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Preface 



This volume contains the papers selected for presentation at the 1st Conference 
on Modeling Decisions for Artificial Intelligence (MDAI 2004), held in Barcelona, 
Catalonia, August 2-4, 2004. 

The aim of this conference was to provide a forum for researchers to discuss 
models for information fusion (aggregation operators) and decision, to examine 
computational methods and criteria for model selection and determination, and 
to stimulate their application in new contexts. 

Fifty-three papers were submitted to the conference, from 19 different coun- 
tries. Each submitted paper was reviewed by at least two experts on the basis 
of technical soundness, originality, significance and clarity. Based on the review 
reports, 26 papers were accepted for publication in this volume. Additionally, 
this volume contains the plenary talks given at the conference. 

We would like to express our gratitude to the members of the program com- 
mittee as well as to all reviewers for their work. We thank Alfred Hofmann, 
from Springer- Verlag, who supported the publication of these proceedings in the 
LNAI series. 

The conference was supported by the Catalan Association for Artificial 
Intelligence (ACIA), the European Society for Fuzzy Logic and Technology 
(EUSFLAT), the Japan Society for Fuzzy Theory and Intelligent Informatics 
(SOFT), the IEEE Spanish Chapter, the Spanish Council for Scientific Research 
(CSIC) and the Generalitat de Catalunya (AGAUR 2002XT 00111). 
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Introduction to RoboCup Research in Japan 



Yoichiro Maeda 

Department of Human and Artificial Intelligent Systems 
Faculty of Engineering, Fukui University 
3-9-1 Bunkyo, Fukui 910-8507 Japan 
maedaSir .his . fukui-u. ac . jp 



What is RoboCup ? 

RoboCup (Robot World Cup Initiative) is the most famous soccer robot compe- 
tition in the world. However, RoboCup was originally established as an interna- 
tional joint project to promote AI, robotics, and related field. To go toward this 
aim, the soccer game is selected as a primary domain in RoboCup and soccer 
game competitions and international conferences have been organized at differ- 
ent places of the world every year since 1997 [l]-[6]. Currently, about 35 countries 
and 3,000 researchers are participating in the RoboCup project. The final goal of 
the RoboCup project is to develop a team of fully autonomous humanoid robot 
soccer players, according to the official rule of the FIFA, that can win against 
the human World Cup champion team until 2050. 

The first idea of soccer robots was proposed by prof. Alan Mackworth (Uni- 
versity of British Columbia, Canada) in his paper ”On Seeing Robots” in 1992 [7]. 
The Dynamo robot soccer project was established by his group. Also in Japan, 
Japanese researchers organized a Workshop on Grand Challenges in Artificial 
Intelligence in October 1992, discussing possible grand challenge problems. Fur- 
thermore, a group of researchers including Minoru Asada, Yasuo Kuniyoshi, and 
Hiroaki Kitano, decided to launch a robotic competition, tentatively named the 
Robot J-League in June 1993. After that, they renamed the project as the Robot 
World Cup Initiative ’’RoboCup”. 



Competitions and Conferences 

The first official RoboCup competition and conference was held in IJCAI-97, 
Nagoya in 1997. Over 40 teams participated, and over 5,000 spectators attended. 
Since 1997, RoboCup has been held in several places in different countries as 
shown in Table 1. RoboCup competitions and conferences marked in Table 1 
was held at the same time of FIFA WorldCup. 

In this year, RoboCup2004 was held in Lisbon, Portugal. In the future, 
RoboCup2005 will be held in Osaka, Japan and RoboCup2006 in Germany syn- 
chronizing with FIFA WorldCup 2006. Regional competitions and workshops 
related to RoboCup (See Table 2) have been also held in various countries ac- 
tively. In Japan, RoboCup pre-competition (Called Japan Open) has been held 
every year since 1998. 
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Table 1. World Championship Competitions and Conferences 





Venue 


Participants 


RoboCup 97 


Nagoya (Japan) 


10 Countries / 40 Teams 


RoboCup 98 * 


Paris (France) 


20 Countries / 63 Teams 


RoboCup 99 


Stockholm (Sweden) 


35 Countries / 120 Teams 


RoboCup 2000 


Melbourne (Australia) 


19 Countries / 110 Teams 


RoboCup 2001 


Seattle (U.S.A.) 


22 Countries / 119 Teams 


RoboCup 2002 * 


Fukuoka (Japan) / Busan (Korea) 


29 Countries / 188 Teams 


RoboCup 2003 


Padua (Italy) 


34 Countries / 277 Teams 


RoboCup 2004 


Lisbon (Portugal) 


30 Countries / 265 Teams 


RoboCup 2005 


Osaka (Japan) 


(Planned Schedule) 


RoboCup 2006 * 


Germany 


(Planned Schedule) 



* held at the same time of FIFA WorldCup 



Table 2. Regional Competitions and Workshops 



[1998] 

RoboCup Pacific Rim Series 98 Singapore 
RoboCup-98 IROS Series at Victoria 
VISION RoboCup 98 at Germany 
RoboCup Japan Open 98 Tokyo 

RoboCup Simulator League Exhibition at Autonomous Agent 98 
AAAI-98 Mobile Robot Competition and Exhibition 


Singapore 

Ganada 

Germany 

Japan 

U.S.A. 

U.S.A. 


[1999] 

RoboCup Japan Open 99 Nagoya 


Japan 


[2000] 

RoboCup Euro 2000 Amsterdam 
RoboCup Japan Open 2000 Hakodate 


Netherlands 

Japan 


[2001] 

RoboCup German Open 2001 
RoboCup Japan Open 2001 Fukuoka 


Germany 

Japan 


[2002] 

RoboCup German Open 2002 
RoboCup Japan Spring Games 2002 Tokyo 


Germany 

Japan 


[2003] 

RoboCup Japan Open 2003 Niigata 
RoboCup American Open 2003 
RoboCup German Open 2003 
RoboCup Australian Open 2003 


Japan 

U.S.A 

Germany 

Australia 


[2004] 

RoboCup Japan Open 2004 Osaka 
RoboCup American Open 2004 
RoboCup German Open 2004 


Japan 

U.S.A 

Germany 
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Main Domains 

In the RoboCup, the project is mainly organized in three domains. The RoboCup 
International Symposium is also held in conjunction with the soccer competitions 
as the core meeting for the presentation of scientific contributions in areas of 
relevance to RoboCup. 

— RoboCupSoccer: International Robot World Cup Initiative of Soccer Game 
Competition by Computer Simulation and Real Robots 

• Simulation League 

• Small Size Robot League (f-180) 

• Middle Size Robot League (f-2000) 

• Four-Legged Robot League (Supported by Sony) 

• Humanoid League (Since 2002) 

— RoboCupRescue: Rescue Application in Large Scale Disasters by Technolo- 
gies Developed through RoboCup Soccer 

• Rescue Simulation League 

• Rescue Robot League 

— RoboCup Junior: Project-Oriented Educational Initiative of Regional and In- 
ternational Robotic Events for Young Students 

• Soccer Challenge 

• Dance Challenge 

• Rescue Challenge 

Research Subjects 

RoboCup is a landmark project to bring up AI and intelligent robotics research. 
Technologies generated in RoboCup are able to be applied for socially significant 
problems and industries. For example, to realize an actual soccer robot, it is 
necessary for various technologies including the following research elements. So, 
RoboCup is a very attractive research area for AI and robotics researchers. 

~ High performance locomotive mechanism 

— Adaptive behavior selection in dynamic environment 

— Real-time reasoning and learning 

— Strategy acquisition for team play 

— Cooperative behavior in multi-agent robot 

— Design methodology of autonomous agents 

— Object recognition by sensor-fusion 

— Self-localization method from sensing information 

— Communication between agents by wireless-LAN system etc. 

However, I regret that some researchers blame for the soccer robot research 
because they think it seems to be only a game. I think they don’t understand 
the academic significance and various research subjects in RoboCup. I would 
like many AI and robotics researchers to understand the efficiency of RoboCup 
research. 
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RoboCup Research in Japan 

Finally, I will briefly introduce the latest Japan Open held in this year. 22 teams 
in Simulation League, 10 teams in Small Size Robot League, 8 teams in Middle 
Size Robot League, 9 teams in Four-Legged Robot League, 9 teams in Humanoid 
League, 3 teams in Rescue Simulation League, 10 teams Rescue Robot League, 
38 teams in Soccer Challenge, 6 teams in Dance Challenge and 9 teams in Rescue 
Challenge participated in Japan Open 2004. Three days preliminary match and 
one day final match was held in Osaka (See Fig.l). 

In Japan, Humanoid League robots are actively developed recently because 
this league has just started at RoboCup2002 in Fukuoka. Many Japanese teams 
in every RoboCup league are continue to perform ambitious researches for a new 
robot mechanism, an intelligent control method, an adaptive decision making, 
a high-speed vision system and so on. For example, in the Middle Size Robot 
League, soft computing methods like as fuzzy reasoning, neural networks, genetic 
algorithms and reinforcement learning are also gradually increasing to be used 
in the real robot as shown in the following researches. 

Team EIGEN (Keio Univ.) 

— Motion control based on fuzzy potential method with omni-directional vision 
system [8] 

— Neural network controller with weighted values tuned by genetic algo- 
rithms [9] 

Team Trackies (Osaka Univ.) 

— Behavior acquisition by vision-based reinforcement learning [10] 

— Multi-controller fusion in multi-layered reinforcement learning [11] 

— Behavior generation for a mobile robot based on the adaptive fitness func- 
tion [12] 

Team KIRC (Kyushu Inst, of Tech.) 

— Extended Q-learning method using self-organized state space based on be- 
havior value function [13] 



Conclusions 

The brief summary of RoboCup competitions and conferences were introduced 
in earlier section of this paper and the latest RoboCup competitions in Japan 
were also reported in the last section. As I could not explain in detail about 
recent research targets for RoboCup and intelligent control methods using soft 
computing in this paper, some of them will be introduced in the plenary-talk 
of this conference. You will be able to And the detail information for RoboCup 
competitions and conferences at the RoboCup Offlcial Site [14]. 
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c) Small Size Robot League d) Middle Size Robot League 




g) Four-Legged Robot League Ir) Humanoid League 



Fig. 1. Scenery of RoboCup Japan Open 2004 in Osaka 
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Abstract. We bring an overview of fuzzy integrals, including historical 
remarks. Choquet integral can be traced back to 1925. Sugeno integral 
has a predecessor in Shilkret integral from 1971. Some other fuzzy inte- 
grals and the corresponding discrete integrals are also given. An appli- 
cation of Choquet integral to additive impreciseness measuring of fuzzy 
quantities with interesting consequences for fuzzy measures is presented. 
Finally, recent development and streaming of fuzzy integrals theory are 
mentioned. 



1 Introduction 

The history of integration began with the (cr-additive) measure on the real 
line/plane assigning to intervals their length / to rectangles their area. This 
measure is known now as the Lebesgue measure on Borel subsets of K / and 
the first results in this direction were obtained by ancient Greeks. Later gener- 
alizations to abstract spaces but still based on a (a-additive) measure led to the 
Lebesgue integral and several related integrals with either special domains (Ba- 
nach spaces, for example) or with special types of additivity (semigroup- valued 
measures, pseudo-additive measures). For an exhaustive overview we recommend 
a recent handbook [26] . 

All above mentioned types of integrals are based on partition-based represen- 
tation of simple functions (i.e., functions with finite range) and the composition 
property of measures (i.e., measure of a union A U i? of disjoint events depends 
only on measure of A and measure of B). 

However, more then 70 years ago, scientists became more and more inter- 
ested into monotone set functions without composition property of any type. 
Recall here for example submeasures, supermeasures, sub-(or super-) modular 
measures, etc. Note also that the first trace of an integral with respect to such 
set functions goes back to Vitali to 1925 [33], and his integral was independently 
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proposed by Choquet in 1953 [9] (and it wears now Choquet’s name). In differ- 
ent branches of mathematics, there are different names for the same object of 
our interest. Monotone set functions vanishing in the empty set (and defined on 
a tr-algebra, but also on a paving only) are called premeasures [29], capacities 
(in original [9] with additional requirements, now often abandoned), monotone 
games, monotone measures. Throughout this contribution, we will use the name 
fuzzy measure, though again its original definition by Sugeno [30] was more 
restrictive. 

Definition 1. Let (X,A) be a measurable space, i.e., X ^ tl) is a universe 
and A C V{X) a a-algebra. A mapping m : M > [0, 1] is called a fuzzy measure 



are fulfilled. 

Note also that in the fuzzy measure theory, sometimes the range of m is 
allowed to be [0,oo], not forcing the normalization condition m{X) = 1. On 
infinite spaces, also some types of continuity used to be required, especially the 
left-continuity (lower semi-continuity), i.e., 

(iii) An / A ^ m{An) / m{A) . 

If necessary, we will indicate that (iii) is required. 

Our aim is to discuss so called fuzzy integrals, i.e., integrals based on fuzzy 
measures. In general, (measurable) functions to be integrated are X [0, 1] 
mappings, and thus they can be treated as fuzzy subsets of X. In such case, the 
fuzzy integral / is always supposed to be a monotone extension of the underlying 
fuzzy measure m (acting on crisp subsets of X) which will act on (measurable) 
fuzzy subsets of X. 

2 Basic Fuzzy Integrals 

Following Zadeh [36], we will call M-measurable fuzzy subsets of X fuzzy events. 
Note also that Zadeh [36] defined a fuzzy probability measure M for fuzzy 
events by 



where P is a fixed probability measure on (X,A) and M(f) is the standard 
Lebesgue integral of / with respect to P. 

Obviously, M extends P from crisp subsets of X into fuzzy events on X. 
However, the requirement that P is a probability restrict Zadeh’s proposal effi- 
ciently. As already mentioned, the first known approach to fuzzy integrals can 
be found in [33], but also (and independently) in [9] and [29], and this integral 
is now known as Choquet integral. 



if 

(i) ] m(ib) = 0 and m(X) = 1 



(boundary condition) 



and 



(a) A C B C X m(A) < m(B) 



(monotonicity) 
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Definition 2. Letm he a fuzzy measure on (X,A). The Choquet integral Cm{f) 
of a fuzzy event f with respect to m is given by 

1 

Cmif) = j m{f > x)dx, (1) 

0 

where the right-hand side of (1) is the Riemann integral. 

Note that the Choquet integral is well defined for any fuzzy measure (with 
no continuity requirement). However, if m is left-continuous then also Cm is 
a left-continuous functional. Observe that up to the monotonicity and boundary 
conditions C'm(O) = 0,C'm(l) = 1 (i-e., Cm on finite X with n-elements is an n- 
ary aggregation operator [21], see also [6, 20]), a genuine property characterizing 
the Choquet integral is the comonotone additivity, i.e., 

Cm{f + g) = Cm{f) + Cm{g) (2) 

whenever f,g,f-\-g are fuzzy events and / is comonotone to 

9, if{x) - f{y)){g{x) - g{y)) > 0 

for all x,y G X. This property was just another source for introducing the Cho- 
quet integral, i.e., the only monotone functional I (on events) which is comono- 
tone additive and fulfill the boundary conditions is exactly the Choquet inte- 
gral Cm with m{A) = /(1a), see [27]. 

Observe that the comonotone additivity (with monotonicity) ensures also the 
homogeneity, I{cf) = cl{f) for any real c such that / and cf are fuzzy events. 
Finally, note that the comonotone additivity (in representation of functionals) 
can be replaced by the homogeneity and the horizontal additivity 

I{f) = I{f /\ a ) + I{f - f Aa), 

a G [0, 1], / a fuzzy event. Thus, Choquet integral is also (positively) homogenous 
and horizontal additive. For more details we recommend [10, 2]. 

Note also that for a-additive fuzzy measures, i.e., for probability measures, 
Choquet integral coincide with the usual Lebesgue integral. Moreover, for an 
oo-monotone fuzzy measure m, i.e., a belief measure [25, 34], we have Cm{f) = 
inf{Cp(/) j P > m}. Similarly, for duals to belief measures, = 1 — 

m(H'^), i.e., for plausibility measures, (oo-alternating fuzzy measures, [34, 25]) 
we have Cm^if) = snp{Cp{f) \ P < m‘^}. 

Sugeno integral was introduced in [30] . 

Definition 3. Letm he a fuzzy measure on (X,A). The Sugeno integral Sm(f) 
of a fuzzy event f with respect to m is given by 

1 

Sm{f) = V min(a;,m(/ > x)). 

x—0 



( 3 ) 
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Note that if m is a possibility measure on X [37, 1 1] induced by a possibility 
density : X ^ [0, 1], m{A) = sup{(^(a;) \ x € A}, then (3) can be rewritten 
into 

Sm{f)= V min(v?(a;),/(x)). (4) 

xGX 

Similarly as in the case of Choquet integral, Sugeno integral represents func- 
tionals which are monotone, /(I) = 1,7(0) = 0, comonotone maxitive (i.e., 
I{f\/g) = I{f) V I{g) whenever / and g are comonotone) and min-homogenous, 
i.e., I{a A f) = a A /(/) for all a G [0,1]. For a full proof we recommend 
overview [2] . Note that the comonotone maxitivity in the above claim can be re- 
placed by a weaker condition of horizontal maxitivity /(/) = /(/Aa) V/(/a), a G 
[0, 1], where 

if/(x)>a, 

’ \0 else 

or even by the max-homogeneity I{aV f) = a V /(/), see [2, 1, 22]. 



3 Some Other Fuzzy Integrals 

In 1971, Shilkret [28] introduced an integral Shim{f) with respect to a maxitive 
fuzzy measure m (i.e., m{A B) = m(A) V m{B) for any events A, B ), which 
can be straightforwardly extended to any fuzzy measure m, 

1 

Shimif) = V {x.m{f > x)). 

x—0 

Similarly, Weber [35] has proposed a generalization of the Sugeno integral. For 
any t-norm T with no zero divisors (for more details see [18]), Weber integral is 
given by 

1 

WtMI) = V ^ ^))- 

a :— 0 

Note that Wmin.m = Sm and that Wrp,m = Shim, where Tp is the product 
t-norm. Alternative approaches were presented also in [23, 17, 19]. 

Several more complicated and even peculiar fuzzy integrals are discussed 
in [2]. Appropriate arithmetical operations for these fuzzy integrals are investi- 
gated in [3]. 

4 Discrete Fuzzy Integrals 

In the case when X is finite we will use convention X = {1,2, . . . ,n},n G N, 
and A = V{X). Then fuzzy events / are, in fact, n-dimensional vectors / = 
€ [0,1]'^. 

Fuzzy integrals can be expressed in such case by several equivalent discrete 
formulas. Some of them are based on the Mdbius transform [13] (namely Cm) or 
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on the possibilistic Mobius transform [13] (namely Sm)- Denote by f!^) 

a non-decreasing permutation of /. Then (with convention /g = 0 and = 1) 

Cmif) = 

n 

E n ■ {m{{j I fj > /'}) - m{{j I fj > Z'+i})) 

i=l 

n 

and 

n 

Smif) = V min(/',m({j | / > /'})), 

i^l 

n 

Shi^(f) = V /' • m({j I / > /'}), 

2 = 1 
n 

WT,m{f)= y T{f',m{{j\f>f'})). 

2=1 

5 Two Applications of Choquet Integral 

n-dimensional fuzzy quantities are normal fuzzy subsets of K" such that each a- 
cut, a g] 0, 1], is a convex compact subset of K”. Typical examples are triangular 
or trapezoidal fuzzy numbers for n = 1, conic or pyramidal fuzzy numbers for 
n = 2. 

For any fuzzy measure m on Borel subsets of M” (not normalized, in general) 
the Choquet integral Im{f) = Cmif) of a fuzzy quantity / is well defined and it 
can be understood as the impreciseness of /. For the sum h = f Si g of two fuzzy 
quantities defined by means of the Zadeh extension principle (or, equivalently 
by means of the sums of corresponding a-cuts), we expect the additivity of 
impreciseness measure achieved by means of the Choquet integral, i.e., 

Imif ffl 5 ) = I-mif) + Imig)- 

Evidently, not any fuzzy measure m ensures the desired result. A complete de- 
scription of appropriate fuzzy measures (i.e., fuzzy measures additive in argu- 
ment, m(A + B) = m(A) -I- m(i?)) was recently shown in [.5]. 

In the case n = 1, the only convenient m is (up to a multiplicative constant) 
the standard Lebesgue measure m = A, i.e., for compact convex subsets of 
K the corresponding length (each such subset is a closed subinterval of K ). 
Moreover, I\{f) is exactly the area of the surface bounded by / and the real 
axis. Already for n = 2, there are several types of appropriate measure m. 

k 

Note that any (even infinite) convex combination m = E Wirrii of fitting fuzzy 

2 = 1 

measures rrii yields a fitting fuzzy measure m (this follows from the additivity 
in measure of Choquet integral). Two typical examples are: 

(i) a fuzzy measure m assigning to each compact convex subset of its 
perimeter (note that the additivity of perimeters of compact convex subsets 
of was observed already by Cauchy); in this case Imif) is the area of 
the surface of the graph of the function / (lateral surface); 
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(ii) for arbitrary fixed angle ip € [0,7r[, a fuzzy measure rriip assigning to each 
compact convex subset of the length of its projection into any straight 
line with slope ip] in this case Im (/) is the area of the projection of the 
body bounded by / and the base plane into any plane perpendicular to the 
plane with slope p. 

Note that one can expect the volume V{f) of the body bounded by / and the 
base plane to be an appropriate impreciseness measure. However, V is not an 
additive function. For example, V{fS /) = 4V{f). 

Another interesting application of Choquet integral is linked to an older 
result of Tarski [31] showing the existence of an additive fuzzy measure m (but 
not cr-additive) on an infinite universe such that m is vanishing on singletons 
(i.e., m{A) = 0 for al finite subsets A C X). Applying the symmetric Choquet 
integral [29] with respect to Tarski’s fuzzy measure m yields a linear operator 
on (real) measurable functions invariant under finitely many changes over the 
functions to be evaluated. For more details we recommend [4]. 

6 Concluding Remarks 

The importance of fuzzy integrals towards applications is hidden in their capa- 
bility to express the possible interaction among single parts of our universe X for 
a global representation of a function (describing the real acting of an observed 
system) by means of a single value. This phenomenon can not be captured by the 
standard Lebesgue integral, though it always play a prominent role also in our 
discussed framework. Expected properties of our functionals, which extend an 
evaluation of crisp events (i.e., fuzzy measures) to an evaluation of fuzzy events 
(i.e., fuzzy integrals) determine our choice of an appropriate fuzzy integral. 

Two prominent fuzzy integrals - Choquet and Sugeno integrals - are linked 
to two different types of arithmetical operations on [0,1] (or, in more general 
form, on [0,oo], or even on [— 00 , 00 ]). Namely to Archimedean operations of 
(truncated) summation and multiplication in the first case, and to idempotent 
operations max (sup) and min in the second case. Several attempts to connect 
these two types of fuzzy integrals (or, equivalently, of arithmetical operations) 
were published so far, see e.g. [8, 23]. Compare also [17, 19] and [3]. 

A recent interesting approach of integrating both Choquet and Sugeno inte- 
grals into a single functional can be found in [32, 24], compare also [7]. Several 
other generalizations or extensions can be found , among others, in [29, 14, 10, 
25, 15, 16]. 
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Abstract. The problem of finding quality information and services on 
the Web is analyzed. We present two user-centered evaluation method- 
ologies to characterize the quality of the Web documents and Web sites 
that contain these Web documents. These evaluation methodologies are 
designed using a fuzzy linguistic approach in order to facilitate the ex- 
pression of qualitative and subjective judgements. These methodologies 
allow to obtain quality evaluations or recommendations on the accessed 
Web documents/sites from linguistic judgements provided by Web vis- 
itors. Then, these recommendations can aid other visitors (information 
or service searchers) to decide which Web recourses to access, that is, to 
hnd quality information and services on the Web. 

Keywords: Web documents, Web services, quality evaluation, fuzzy lin- 
guistic modelling, XML 



1 Introduction 

Nowadays, we can assert that the Web is the largest available repository of data 
with the largest number of visitors searching information. Furthermore, because 
the Internet has become easily accessible to millions of people around world, 
a vast range of Web services have emerged for the most diverse application do- 
mains, e.g., business, education, industry, and entertainment. Therefore, we can 
also affirm that the Web is an infrastructure on which many different applica- 
tions or services (such as e-commerce or search engines) are available. In fact, 
in last few years the Web has witnessed an exponential growth of both informa- 
tion and services [14, 12]. These Web challenges generate new research issues, 
amongst we can cite [3, 12, 7, 28]; to identify Web information and services of 
good quality, to improve the query language of search engines, and to develop 
the Semantic Web. In this paper, we focus on the first one, and in particular we 
address the problem of how to evaluate the quality of both the Web documents 
that store information and the Web sites that provide services, in order to aid 
the users to decide on the best Web recourses to use. 

There exists a large debate on the quality of the recourses available on the 
Web [2, 26]. How to recognize useful and quality recourses in an unregulated 
market place as the Internet is becoming a serious problem in diverse domains 
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as Medicine [4, 15, 6, 20], Organizations [17, 23, 33], Government [24], Educa- 
tion [27] or Law [21]. However, there is not yet, in our opinion, a clearly cut 
definition of the concept of quality. The ISO defines quality as “the totality of 
characteristics of an entity that bear on its ability to satisfy stated and implied 
needs” [18]. Web document and Web site quality evaluation is neither simple nor 
straightforward. Web quality is a complex concept and its evaluation is expected 
to be multi-dimensional in nature. There are two different kinds of requirements 
for Web document and Web site quality evaluation that emerge from the above 
definition: 

1. Design and technical requirements. These imply the general evaluation of all 
the characteristics of Web documents/sites. In this category we find eval- 
uation criteria that are indicators of an objective and quantitative nature, 
e.g., clear ordering of information, broken links, orphan pages, code quality, 
navigation, etc. 

2. Informative content requirements. These imply the evaluation of how well 
the Web documents/sites satisfy the specific user needs. In this category we 
find evaluation criteria that are indicators of a subjective and qualitative 
nature, e.g., consistency, accuracy, relevance, etc. 

A robust and flexible Web quality evaluation methodology should properly com- 
bine both kinds of requirements. However, although some authors [17, 23] have 
proposed Web quality evaluation methodologies which combines both informa- 
tive and technical design aspects, the majority of suggested Web evaluation 
methodologies tend to be more objective than subjective, quantitative rather 
than qualitative, and do not take into account the user perception [5, 22]. An 
additional drawback of these Web evaluation methodologies is that their evalu- 
ation indicators are relevant to Web providers and designers rather than to the 
Web users [1]. 

A global Web quality evaluation methodology cannot entirely avoid users’ 
participation in the evaluation strategy. User judgments can help to evaluate the 
quality of accessed Web documents/sites. The problem here is that the users do 
not frequently make the effort to give explicit feedback. Web search engines can 
collect implicit user feedback using log files. However, this data is still incomplete. 
To achieve better results of evaluation on the Web, the direct participation of 
the user is necessary, i.e., a user-centered Web quality evaluation methodology 
is a necessity. For example, the use of a user-centered approach to evaluate 
Web sites would mean that users are more pro-actively approached to determine 
their needs -both technical and in terms of information-, their perceptions of 
Web site organization, terminology, ease of navigation, etc, which could be used 
in a redesign of the site [17]. 

One possible way to facilitate that user participation is to embed in the Web 
quality evaluation methodology those tools of Artificial Intelligence that allow 
a better representation of subjective and qualitative user judgements, as for 
example, the fuzzy linguistic modelling [31] . The use of fuzzy linguistic modelling, 
to help users express their judgements, could increase their participation in the 
evaluation of the quality of Web documents/sites. 
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The aim of this paper is to present some models, based on fuzzy linguis- 
tic tools, to evaluate the informative quality of Web recourses. In particular, 
we present two fuzzy models, one to evaluate the informative quality of Web 
documents and another to evaluate the informative quality of Web sites used 
to publish those Web documents. The evaluation scheme of both models take 
into account both technical criteria and informative criteria, but both quality 
evaluation models are of a qualitative and subjective nature because: 

— Their underlying evaluation strategies or schemata are user-driven rather 
than designer-driven, i.e., they include user-perceptible Web evaluation in- 
dicators such as navigation or believability, rather than quantifiable Web 
attributes such as code quality or design; that is, we consider Web charac- 
teristics and attributes easily comprehensible by a general Web visitor. 

~ Their measurement methods are user intuition-centered rather than model- 
centered, i.e., the evaluations are obtained from judgements provided by the 
Web visitors rather than from assessments obtained objectively by means of 
the direct observation of the model characteristics. 

Both quality evaluation models are designed using an ordinal fuzzy linguistic 
approach [8, 9]. Visitors provide their evaluation judgements by means of lin- 
guistic terms assessed on linguistic variables [31]. After examining a document 
stored in a particular Web site, the users are invited to complete an evaluation 
questionnaire about the quality of the accessed document or site. The quality 
evaluation value of a Web document/site is obtained from the combination of 
its visitor linguistic evaluation judgements. This combination is carried out by 
using the linguistic aggregation operators: LOWA [9] and LWA [8]. The quality 
evaluation values or recommendations obtained are also of a linguistic nature, 
and describe qualitatively the quality of the Web documents/sites. In this way, 
when a user requires information, then not just retrieved documents or Web 
sites could be provided, but also recommendations on the informative quality of 
them and on Web sites that store similar documents that could be of interest to 
the user. This could be used by the user as an aid to make a decision on which 
Web recourses to access. 

The rest of the paper is set out as follows. The ordinal fuzzy linguistic ap- 
proach is presented in Section 2. The fuzzy qualitative model to evaluate the 
quality of Web documents is defined in Section 3. The fuzzy qualitative model 
to evaluate the quality of Web sites is defined in Section 4. 

2 Ordinal Fuzzy Linguistic Approach 

The ordinal fuzzy linguistic approach [8, 9] is a very useful kind of fuzzy lin- 
guistic approach used for modelling the computing with words process as well 
as linguistic aspects of problems. It is defined by considering a finite and totally 
ordered label set S = {si}, i € {0, . . . , T} in the usual sense, i.e., Si > Sj if i > j, 
and with odd cardinality (7 or 9 labels). The mid term represents an assessment 
of ’’approximately 0.5”, and the rest of the terms being placed symmetrically 
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around it. The semantics of the label set is established from the ordered structure 
of the label set by considering that each label for the pair (si,sr-z) is equally 
informative. 

In any linguistic approach we need management operators of linguistic infor- 
mation. An advantage of the ordinal fuzzy linguistic approach is the simplicity 
and quickness of its computational model. It is based on the symbolic computa- 
tion [8, 9] and acts by direct computation on labels by taking into account the 
order of such linguistic assessments in the ordered structure of labels. Usually, 
the ordinal fuzzy linguistic model for computing with words is defined by es- 
tablishing i) a negation operator, ii) comparison operators based on the ordered 
structure of linguistic terms, and iii) adequate aggregation operators of ordi- 
nal fuzzy linguistic information. In most ordinal fuzzy linguistic approaches the 
negation operator is defined from the semantics associated to the linguistic terms 
as Neg{si) = Sj | j = T — z; and there are defined two comparison operators of 
linguistic terms: i) Maximization operator, MAX(si,Sj) = Si if Si > Sj] and ii) 
Minimization operator, MIN{si,Sj) = Si if < Sj. In the following subsections, 
we present two operators based on symbolic computation. 

2.1 The LOWA Operator 

The Linguistic Ordered Weighted Averaging (LOWA) is an operator used to 
aggregate non-weighted ordinal linguistic information, i.e., linguistic information 
values with equal importance [9]. 

Definition 1. Let A = {ai, . . . ,am} be a set of labels to be aggregated, then 
the LOWA operator, 4>, is defined as (j){ai, . . . , am) = W ■ = C"^{wk, bk, k = 

1, . . . , to} = 0 6i 0 (1 — rci) © C^~^{Ph, bh^h = 2, . . . , to}, where W = 

[wi, . . . ,Wm], is a weighting vector, such that, Wi G [0,1] and SiWi = 1. fih = 
Wh/LJtfiwk,h = 2,..., TO, and B = {bi,...,bm} is a vector associated to A, 
such that, B = a{A) = {ag.(i), • . • , acr(m)}! where, acr(j) ^ flcr(i) ^ j, with 
a being a permutation over the set of labels A. C™ is the convex combination 
operator of m labels and if m=2, then it is defined as C‘^{wi,bi,i = 1,2} = 
vji Q Sj (B (1 — Wi) Q Si = Sk, such that, k = min{T,i + round{w\ • (j — 
z))} Sj, Si G S, {j > i), where ’’round’ is the usual round operation, and b\ = 
Sj, b 2 = Si- If Wj = 1 and Wi = 0 with i j Vz, then the convex combination is 
defined as: C’^lwi, bi,i = 1, . . . , to} = bj. 

The LOWA operator is an ”or-and” operator [9] and its behavior can be 
controlled by means of W. In order to classify OWA operators in regard to 
their localisation between ”or” and ’’and”. Yager [30] introduced a measure of 
orness, associated with any vector W\orness{W) = ~ i)wi. This 

measure characterizes the degree to which the aggregation is like an ”or” (MAX) 
operation. Note that an OWA operator with orness{W) > 0.5 will be an orlike, 
and with orness{W) < 0.5 will be an andlike operator. 

An important question of the OWA operator is the determination of W. 
A good solution consists of representing the concept of fuzzy majority by means of 
the weights of W , using a non- decreasing proportional fuzzy linguistic quantifier 



Fuzzy Qualitative Models to Evaluate the Quality on the Web 



19 



[32]Q in its computation [30]:Wi = Q{i/m) — Q{{i — l)/m),i = 1, . . . ,m, being 

f 0 if r < a 

the membership function of Q: Q{r) = < li a < r <b with a,b,r S [0, 1]. 

[ 1 if r > & 

When a fuzzy linguistic quantifier Q is used to compute the weights of LOWA 
operator, it is symbolized by 4>q. 

2.2 The LWA Operator 

The Linguistic Weighted Averaging (LWA) operator is another important oper- 
ator which is defined to aggregate weighted ordinal linguistic information, i.e., 
linguistic information values with non equal importance [8]. 

Definition 2. The aggregation of a set of weighted linguistic opinions, |(ci,ai), 
..., (cm,am,)}, Ci,Ui € S, according to the LWA operator <P is defined as 
^[(ci,ai), . . . , {cm,am)] = 4>{h{ci,ai), . . . ,h{cm,am)), where Oi represents the 
weighted opinion, Ci the importance degree of ai, and h is the transformation 
function defined depending on the weighting vector W used for the LOWA oper- 
ator (j), such that, h = MIN(ci,ai) if orness{W) > 0.5 and h = M AX{N eg{ci) , 
ai) if orness{W) < 0.5. 

3 A Fuzzy Qualitative Model to Evaluate the Quality 
of Web Sites 

In this Section, we present a fuzzy qualitative model to evaluate the quality of 
Web documents in XML format with the aim of assigning them quality eval- 
uation values or recommendations. It is defined from the user perception, and 
therefore, it is qualitative and subjective. It establishes two elements to achieve 
the quality evaluation values or recommendations: i) an user-driven evaluation 
scheme of Web documents which is associated with their respective DTDs, and 
ii) a user-centered generation method which is based on the LWA and LOWA 
operators. 

In the following Subsections, we analyze both elements. 

3.1 The User-Driven Evaluation Scheme for Web Documents 
in XML Format 

We propose a user-drive evaluation scheme to evaluate the informative quality of 
the Web documents, i.e., the user-driven evaluation scheme is based on relevance 
judgements provided by the users that access to Web documents. Therefore, it is 
defined from the informative elements that compose the DTD of Web documents 
in XML format. 

Given a kind of XML based Web document, for example a ’’scientific article” 
with the DTD <!DOCTYPE article [ 

<!ELEMENT article (title, authors, abstract?, introduction, body, conclusions, 
bibliography) > 
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<!ELEMENT title (#PCDATA)> 

<!ELEMENT authors (author+)> 

<!ELEMENT (author | abstract | introduction) (#PCDATA)> 

<!ELEMENT body (section+)> 

<!ELEMENT section (titleS, #PCDATA)> 

<!ELEMENT titleS (#PCDATA)> 

<!ELEMENT conclusions (#PCDATA)> 

<!ELEMENT bibliography (bibitem+)> 

<!ELEMENT bibitem (#PCDATA)> ] 

we can establish an user-driven evaluation scheme composed by a subset of set of 
elements that define its DTD (e.g. ’’title, authors, abstract, introduction, body, 
conclusions, bibliography”). We assume that each component of that subset has 
a distinct informative role, i.e., each one affects the overall evaluation of a doc- 
ument in a different way. This peculiarity can be easily added in the DTD by 
defining an attribute for each meaningful component that contains a relative 
linguistic importance degree. Then, given an area of interest (e.g. ”web publish- 
ing”), the quality evaluation value for an XML based document is obtained by 
combining the linguistic evaluation judgements provided by a non-determined 
number of Web visitors that accessed to Web documents and provided their 
opinions on the more important elements of DTD associated with Web docu- 
ments. 

3.2 The User-Centered Generation Method for Web Documents 
in XML Format 

Suppose that we want to generate a recommendation database for qualifying 
the information of a set of XML Web documents {di, . . . ,di} with the same 
DTD. These documents can be evaluated from a set of different areas of interest, 
{Ai , . . . ,Am}- Consider an evaluation scheme composed by a finite number of 
elements of DTD, {pi , . . . ,p„}, which will be evaluated in each document dk 
by a panel of Web visitors {ei,...,em}- We assume that each component of 
that evaluation scheme presents a distinct informative role. This is modelled 
by assigning to each pj a relative linguistic importance degree I{pj) € S. Each 
importance degree I{pj) is a measure of the relative importance of element pj 
with respect to others existing in the evaluation scheme. We propose to include 
these relative linguistic importance degrees in the DTD. This can be done easily 
by defining in the DTD an attribute of importance ’’rank” for each component 
of evaluation scheme. 

Let G S' be linguistic evaluation judgement provided by the visitor 
measuring the informative quality or significance of element pj of document di 
with respect to the area of interest At- Then, the evaluation procedure of an 
XML document di obtains a recommendation r\ & S using the LWA-LOWA 
based aggregation method in the following steps: 

1. Capture the topic of interest (At), the linguistic importance degrees of eval- 
uation scheme fixed in the DTD {I{pi), . . . ,I{Pn)}, and all the evaluation 
judgements provided by the panel of visitors j = l,...,n}, k = 
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1, . . . ,m. To do so, we associate with each XML document an evaluation 
questionnaire of relevance that depends on the kind of document. For exam- 
ple, if the XML document is the above ’’scientific article” with that DTD, 
then we can establish the relevance evaluation questionnaire on the follow- 
ing set of elements of DTD : ’’title, authors, abstract, introduction, body, 
conclusions, bibliography”. In this case, the relevance evaluation question- 
naire would have 7 questions, and for example, a question could be ’’What is 
the relevance degree of the title with respect to the search topic?” . In other 
kinds of XML documents we have to choose the set of elements of DTD, 
{pi, . . . ,pn}, to be considered in the relevance evaluation questionnaire. 

2. Calculate for each his/her individual recommendation by means of 
the LWA operator as 

rlt = . . . , (/(p„), = <j)Q^{h{I{pi),eft), . . . , /i(/(p„), e^^)). 

Therefore, is a significance measure that represents the informative qual- 
ity of di with respect to topic At according to the Q 2 evaluation judgements 
provided by e^. 

3. Calculate the global recommendation rl by means of an LOWA operator 
guided by the fuzzy majority concept represented by a linguistic quanti- 
fier Qi as 

rl = 

In this case, rj is a significance measure that represents the informative qual- 
ity of di with respect to topic At according to the Q 2 evaluation judgements 
provided by the Qi recommenders. r| represents the linguistic informative 
category of di with respect to the topic At- 

4. Store the recommendation r\ in a recipient in order to assist users in their 
later search processes. 

In the evaluation procedure the linguistic quantifiers Q\ and Q 2 represent the 
concept of fuzzy majority in the computing process with words. In such a way, 
the recommendations on documents are obtained by taking into account the 
majority of evaluations provided by the majority of recommenders. 

4 A Fuzzy Qualitative Model to Evaluate the Quality 
of Web Sites 

In [16, 19, 25, 29] it was proposed an information quality framework by con- 
sidering that the quality of the information systems cannot be assessed inde- 
pendently of the information consumers’ opinions (people who use information) . 
This framework establishes four major information quality categories to classify 
the different evaluation dimensions [16, 19, 25, 29]: 

1 . Intrinsic information quality, which emphasizes the importance of the infor- 
mative aspects of the information itself. Some dimensions of this category 
are: accuracy of the information, believability, reputation and objectivity. 
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2. Contextual information quality, which also emphasizes the importance of the 
informative aspects of the information but from a task perspective. Some 
dimensions of this category are: value-added, relevance, completeness, time- 
liness, appropriate amount. 

3. Representational information quality, which emphasizes the importance of 
the technical aspects of the computer system that stores the information. 
Some of its dimensions are: understandability, interpretability, concise rep- 
resentation, consistent representation. 

4. Accessibility information quality, which emphasizes the importance of the 
technical aspects of the computer system that provides access to information. 
Some dimensions of this category are: accessibility and secure access. 

Using this information quality framework we develop an fuzzy qualitative 
model to evaluate the quality of the Web sites that provide information stored 
in XML documents. It is defined from the information consumers’ perspective, 
and therefore, it is also qualitative and subjective. It is composed of two elements: 

1. A user-driven evaluation scheme, that contains dimensions easily compre- 
hensible to the information consumers (e.g. relevance, understandability) 
rather than dimensions that can be objectively measured independently of 
the consumers (e.g. accuracy measured by the number of spelling or gram- 
matical errors). 

2. A user-centered generation method, that generates linguistic recommenda- 
tions on Web sites from the evaluations provided by different visitors to Web 
sites. 

Both elements are presented in the following Subsections. 



4.1 The User-Driven Evaluation Scheme for Web Sites 

We analyze Web sites that store information in multiple kinds of documents 
structured in the XML format (e.g. scientific articles, opinion articles) when 
users visit them occasionally because they store documents which meet their 
information needs. Therefore, user opinions on the informative quality of these 
documents (e.g. the relevance) must be an important dimension in the evaluation 
scheme. Taking into account these considerations, we define an evaluation scheme 
of Web sites oriented to the user that contemplates four quality categories with 
the following evaluation dimensions: 

1. Intrinsic quality of Web sites. Accuracy of information is the main determi- 
nant of the intrinsic information quality of information systems. We discuss 
accuracy of Web sites by considering what visitors think about the believ- 
ability of the information content that the Web site provides. Given that we 
consider Web sites as information sources that are visited occasionally, we 
are not interested in evaluating the accuracy by means of grammatical and 
spelling errors or relevant hyper-links existing on the Web site. 




Fuzzy Qualitative Models to Evaluate the Quality on the Web 



23 



Table 1. User-driven evaluation scheme of Web sites 



INFORMATION QUALITY CATEGORIES 


EVALUATION DIMENSIONS 


Intrinsic quality of Web sites 


believability 


Contextual quality of Web sites 


relevancy, timeliness, 
completeness 


Representational quality of Web sites 


understandability of Web sites, 
originality, understandability 
of documents, conciseness 


Accessibility quality of Web sites 


navigational tools 



2. Contextual quality of Web sites. This is the most important category in the 
evaluation scheme. We propose to evaluate this category by considering what 
visitors think about the relevancy, timeliness and completeness of documents 
that the Web site provides them with when they search for information 
about particular topic, i.e., if documents are relevant to the search topic, if 
documents are sufficiently current and up-to-date with regards to the search 
topic, and if documents are sufficient complete with regards to the topic. 

3. Representational quality of Web sites. We analyze this category for the Web 
sites that provide information stored in XML documents from two aspects: i) 
representational aspects of Web site design and ii) representational aspects 
of documents stored in the Web site. In the first case, we consider what 
visitors think about the understandability of the Web site, i.e., whether or 
not the Web site is well organized in such a way that visitors can easily 
understand how to access stored documents. In the second one, we consider 
what visitors think about the understandability, originality and conciseness 
of the information content of XML documents used. 

4. Accessibility quality of Web sites. We consider that this category must be 
assessed as to whether or not the Web site provides enough navigation mech- 
anisms so that visitors can reach their desired documents faster and easier. 
Lacking effective paths to access the desired documents would handicap vis- 
itors, therefore navigation tools are necessary to help users locate the infor- 
mation they require. We evaluate this category by considering what visitors 
think about the navigational tools of the Web site. The security dimension 
is not a key aspect on the Web sites that we are considering. 

The evaluation scheme is summarized in Table 1. 



4.2 The User-Centered Generation Method for Web Sites 

In this Subsection, we present a generation method of linguistic recommenda- 
tions for evaluating the informative quality of Web sites. These linguistic rec- 
ommendations are obtained from the linguistic evaluation judgements provided 
by a non-determined number of Web visitors. After a visitor has used an XML 
document stored in a Web site, he/she is invited to complete a quality evaluation 
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questionnaire as per the quality dimensions established in the above evaluation 
scheme. The recommendations are obtained by aggregating the linguistic evalu- 
ation judgements by means of the LWA and LOWA operators. 

The quality evaluation questionnaire provides questions for each one of the 
dimensions proposed in the evaluation scheme, i.e., there are nine questions: 
{qi, . . . ,qg}. For example for the quality dimension helievahility the question qi 
can be: ’’What is the degree of believability of this Web site in your opinion?”. 
The concept behind each question is rated on a linguistic term set S. We should 
point out that the question q 2 = relevancy is not evaluated directly by means of 
a particular value supplied by a user. This dimension is evaluated applying the 
fuzzy qualitative model to evaluate the quality of Web documents presented in 
Section 3. Furthermore, we assume that each quality dimension does not have the 
same importance in the evaluation scheme, i.e., it is assigned a relative linguistic 
importance degree for each quality dimension: {/(gi), . . . , I{qg)}, I{qi) G S. To 
assign these degrees, the quality dimensions related to the Web site content itself 
(those included in the first and second category of evaluation scheme) should 
have more importance than the remaining ones. In particular, the relevancy has 
the greatest degree of relative importance. 

Summarizing, the quality evaluation questionnaire that a visitor must com- 
plete is comprised of 8 questions, given that the relevance is associated with the 
accessed Web document assessed according to the evaluation model presented 
in Section 3. 

Suppose that we want to generate a recommendation database for qualifying 
the informative quality of a set of Web sites {Web\, . . . ,Webi,} which stores 
information in XML documents. These Web sites can be evaluated from a set of 
different areas of interest or search topics, {Ai , . . . , Am}- Suppose that D[ rep- 
resents the set of XML documents stored in the Web site Web[. We consider that 
each XML document dj G Di presents an evaluation scheme composed of a finite 
set of elements of its DTD, {pi , . . . ,p„}, and its respective relative linguistic im- 
portance degrees {I{pi), . . . ,I{Pn)}- Let {e™’\ . . . , e™’*} be the set of different 
visitors to the Web site Webi who completed the quality evaluation questionnaire 
{qi, . . . ,qg} when they searched for information about the topic Am- In the qual- 
ity evaluation scheme each question qi is associated to its respective linguistic im- 
portance degree I{qi)- Let {q {, . . . , q^} be a set of linguistic assessments provided 
by the visitor We must point out that the assessment ql is achieved from the 
set of linguistic evaluation judgements (e^\ . . . , e^^} provided by the visitor e™’ 
regarding the set of elements of DTD, {pi, . . . ,p„}, associated to the XML doc- 
ument accessed dj. Then, q^ is obtained using the LWA operator as follows: 
dl = = (j}Q^{h{I{pi),e^‘), .-- ,h{I{pn),eZ^)), be- 

ing Qs the linguistic quantifier used to calculate the weighting vector W. If we 
assume that Qs represents the concept of fuzzy majority then q^ is a measure 
of significance that represents the relevance of dj with respect to the topic Ai 
according to Qs linguistic evaluation judgements provided by on the mean- 
ingful elements of DTD associated with dj. Then, given a search topic Am, the 
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generation process of a linguistic recommendation r™’* G S for a Web site Webi is 

obtained using a LWA-LOWA based aggregation method in the following steps: 

1. Calculate for e™’* his/her individual recommendation rj"’* by means of LWA 
<P: r™’' = <P[{I{qi),ql),...,{I{qg),ql)] = 4>Q^{h{I{qi),q\), . . . ,h{I{qg),ql)). 

is a measure that represents the informative quality of the Web site 
Webi with respect to topic Am according to the Qg linguistic evaluation 
judgements provided by the visitor 

2. Calculate the global recommendation by means of an LOWA operator 

guided by the fuzzy majority concept represented by a linguistic quanti- 
fier Qi as r™’* = , r™ ’*). In this case, r™’* is a measure that 

represents the informative quality of the Web site Webi with respect to 
topic Am according to the Qg evaluation judgements provided by the Q\ 
visitors or recommenders. r™’* represents the linguistic informative category 
of Webi with respect to the topic Am- 

3. Store the recommendation r™’* in order to assist user future search processes. 
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Abstract. Multisets are now a common tool and a fundamental frame- 
work in information processing. Their generalization to fuzzy multisets 
has also been studied. In this paper the basics of multisets and fuzzy mul- 
tisets are reviewed, fundamental properties of fuzzy multisets are proved, 
and advanced operations are defined. Applications to rough sets, fuzzy 
data retrieval, and automatic classification are moreover considered. 



1 Introduction 

Recently many studies discuss multisets and their applications such as databases, 
information retrieval, and new computing paradigm [-5]. 

Multisets have sometimes been called bags. Indeed, while the well-known 
book by Knuth [9] uses the term of multisets, another book by Manna and 
Waldinger [11] devotes a chapter to bags. The terms of multiset and bag can thus 
be used interchangeably. Although the author prefers to use the term multisets, 
some readers may interpret them to be bags instead. 

This paper discusses a generalization of multisets, that is, fuzzy multisets [28, 
7, 24, 25, 8, 14, 15, 16, 18, 23, 10]. Hence the ordinary nonfuzzy multisets are 
sometimes called crisp multisets by the common usage in fuzzy systems theory. 

A characteristic of fuzzy multisets is that definitions of elementary operations 
require a nontrivial data handling of sorting membership sequences. We will 
see why such sorting is essential by introducing the a-cut for fuzzy multisets. 
Another cut operation for a multiset is called here z/-cut that corresponds to the 
a-cut for fuzzy sets. The v-cnt is generalized to fuzzy multisets and commutative 
properties between these cuts and elementary operations are proved. 

Both theoretical and real-world applications of fuzzy multisets are consid- 
ered. As a theoretical application, rough approximations [21] of fuzzy multisets 
are discussed. Moreover fuzzy database systems and information retrieval are 
mentioned. Lastly, methods of data classification and clustering are briefly dis- 
cussed. 

* This research has partially been supported by the Grant-in-Aid for Scientific Re- 
search, the Ministry of Education, Sports, Culture, Science and Technology, Japan, 
No. 16650044. 
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2 Multisets and Fuzzy Multisets 

Before considering fuzzy multisets, a brief review of crisp multisets [9, 11, 2] is 
given. We assume finite sets and multisets for simplicity. 

2.1 Crisp Multisets 

Let us begin by a simple example. 

Example 1. Assume X = {x,y,z,w} is a finite set of symbols. Suppose we 
have a number of objects but they are not distinguishable except their labels 
x,y, or z. For example, we have two balls with the label x and one ball with y, 
three with z, but no ball with the label w. Moreover we are not allowed to 
put additional labels to distinguish two x’s. Therefore a natural representation 
of the situation is that we have a collection {x,x,y,z,z,z}. We can also write 
{2/a;, l/y,3/z,0/w} to show the number for each element of the universe X, or 
{2/a:, l/y,3/z} by ignoring zero of w. We will call there are three occurrences 
of a:, two occurrences of y, and so on. 

We proceed to general definitions. Assume X = {a:i,...,a;„} is a finite set 
of universe or the basis set. A crisp multiset M of X is characterized by the 
function Count m{‘) whereby a natural number (including zero) corresponds to 
each X G X: CountM ■ W — > {0, 1, 2, . . .} (cf. [2, 9, 11]). 

For a crisp multiset, different expressions such as 



M = {xi, . . . ,xi, . . . ,x„, . . . ,x„} 

are used. An element of X may thus appear more than once in a multiset. In the 
above example x\ appears k\ times in M, hence we have fci occurrences of X\. 
The collection of all crisp multisets of X is denoted by C{X) here. 

Let us consider the first example: {2/x, l/y,3/z}. We have 

CountM{x) = 2, CountM{y) = 1, CountM{z) = 3, CountM{w) = 0. 

The followings are basic relations and operations for crisp multisets. 

1. (inclusion): M C N ^ Count m{x) < Count n{x), Mx G X. 

2. (equality): M = N ^ CountM{x) = Count^ix), Vx G X. 

3. (union): CountMuN(x) = m&x{CountM{x), CountN^x)}. 

4. (intersection): CountMoNix) = mini Count m{x), Count ^{x)}. 

5. (addition): CountM®N{x) = CountM{x) + CountN(x). 

Readers should note that the operations resemble those for fuzzy sets, but 
the upper bound for Count {•) is not assumed. 



M = {fci/xi, . . . ,fc„/x„} 



and 
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Since the upper bound is not specified, the complement of multiset is difficult 
to be studied. Instead, an operation of nonstandard difference M ^ N is defined 
as follows. 



Count m~n{x) 



CountM{x), {Count pf(x) = 0), 
0 {Count]\f{x) > 0). 



Example 2. Consider the multiset M in Example 1 and N = {1/x, 4/y, 3/w}. 
Then, 



M®N= {3/x, 5/y, 3/z, 3/w}, 
MUiV={2/x,4/y,3/z,3M, 
MniV={l/x,l/y}, 

M N = {3/z}. 



2.2 Fuzzy Multisets 

Yager [28] first discussed fuzzy multisets where he uses the term of fuzzy bay, 
an element of X may occur more than once with possibly the same or different 
membership values. 

Example 3. Consider a fuzzy multiset 

A = {(x, 0.2), (x, 0.3), (y, 1), (y, 0.5), (2/, 0.5)} 

of X = {x,y,z,w}, which means that we have x with the membership 0.2, x 
with 0.3, y with the membership 1, and two y’s with 0.5 in A. We may write 

A={{0.2,0.3}/x,{l,0.5,0.5|M 

in which the multisets of membership {0.2, 0.3} and (1, 0.5, 0.5} correspond to x 
and y, respectively. 

Count a{x) is thus a finite multiset of the unit interval [28]. The collection 
of all fuzzy multisets is denoted by tFA4{X), while the family of all (ordinary) 
fuzzy sets is denoted tF{X). 

For X € X, a membership sequence is defined to be the decreasingly ordered 
sequence of the elements in Count a{x). It is denoted by 

(MA(a;),Iii(a;),...,MA(a^)), 

where y]\{x) > y\{x) > • • • > y^ix). Hence we can write 

A = {{y\{x),...,pf^{x))/x},,(zx ( 1 ) 

In order to define an operation between two fuzzy multisets A and B, the 
lengths of the membership sequences y\{x), y\{x ), . . . , li^(x) and pg{x), y^{x), 
. . . , Ai^(x) should be set to be equal. We therefore append an appropriate num- 
ber of zeros for this purpose. The resulting length for A and B is denoted by 
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L{x;A,B) = ma.x{p,p'}: it depends on each x £ X. We sometimes write L{x) 
instead of L{x] A, B) when no ambiguity arises. 

If we define the length L(x; A) by 

L{x\ A) = max{j : p\{x)^Q}. 

we have L{x\ A, B) = ma.x{L{x; A),L{x; B)}. 

Example 4. Let 

A ={{0.2,0.3}/x,{l,0.5,0.5}/y}, 

B = {{0.6}/x, {0.8, 0.6}/y, {0.1, 0.Tj/rc}. 

For the representation of the membership sequence, we put 

L(x) = 2, L{y) = 3, L{z) = 0, L{w) = 2 

and we have 

A = {(0.3, 0.2)/x, (1, 0.5, 0.5)/y, (0, 0)/w}, 

B = {(0.6, 0)/x, (0.8, 0.6, 0)/y, (0.7, QA)/w}. 

Basic relations and operations for fuzzy multisets are as follows [14]. 

1. [Inclusion] 

A C B p?jiXx) < p?q{x), j = 1, . . . , L(x), \/x G X. 

2. [Equality] 

A = B^ p\{x) = Pgix), j = 1, . . . , L(x), Vx G X. 

3. [Addition] A© B is defined by the addition operation in A x [0, 1] for crisp 
multisets [28]: if A = {(xi, /i^), . . . , (xfc, /ife)} and B = {(xp,^p), . . . , {xr,Pr)} 
are two fuzzy multisets, 

A (B B — { (Xj , , . . . , {xk J y/c) J (^p : Mp) ; • ■ ■ ; (^r ; Mr) } • 

4. [Union] 

Maub(^) = V Ms(a;), j = 1, ■ • . ,L(x). 

5. [Intersection] 

Ansi^) = ^ Ms(a;), j = 1, ■ • . ,L(x). 

6. [t-norm and conorm] Let a t-norm and conorm operations for two fuzzy 
sets F, G be ETG and FSG, respectively; they are given by 

fJ'FToix) = t{pF{x),PG{x)), f^FTG(x) = t{pF{x) , Pg{x)) ■ 

A well-known operation is the algebraic product Ta for which t(a, b) = ab 
and hence pfTag{x) = plf{x)plg{x). Their extensions to fuzzy multisets are 
straightforward: 

= t{FAi^),FBi^)), j = ^,---,L{x), 

Fasb(^) = s(FAi^), j = l,...,L(x). 
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7. [a-cut] The a-cut {a G (0,1]) for a fuzzy multiset A, denoted by [A]^, is 
defined as follows. 



/i^(x) < a Count[A]^{x) = 0, 

> ct, < a ^ Count 



j = 1, . . .,L{x). 



(x) = j, 



Moreover the strong a-cut (a G [0, 1)), denoted ]A[^, is defined as follows. 

/i^(x) < a ^ Count]A[^{x) = 0, 

t^A^x) > <x, fj,^^^{x) < a Count]A[^{x) = j, 
j = 1, . . .,L{x). 

8. [Cartesian product] Given two fuzzy multisets A = {(x,/i)} and B = 
{{y, v)}, the Cartesian product is defined: 

A X B = {{x,y,fj,Av)} 

The combination is taken for all {x, y) in A and (y, v) in B. 

9. [Difference] The nonstandard difference A ^ B is defined as follows. 



f^A^Bix) 



t^A(x), {tJ-B{x)=0) 

0 , it^Bix) > 0 ) 



where j = 1, . . . ,L{x). 

10. [Multirelation] Notice that a crisp relation i? on X is a subset of X x X. 
Given a fuzzy multiset xl of X, a multirelation TZ obtained from i? is a subset 
oi Ax A\ for all (x, y), {y, v) G A, 



(x, y,y A b) gTZ 



(x,y) G R 



When i? is a fuzzy relation on X, then 

{x,y,yAiyAR{x,y))en. (2) 

(The latter includes the former as a special case.) 



The following propositions are valid. The proofs are immediate and therefore 
omitted here. 

Proposition 1. Assume A and B are fuzzy multisets of X . The necessary and 
sufficient condition for A C B is that for all a G (0, 1], [A]^ C [B]^. Moreover, 
the condition for A = B is that for all a G (0, 1], [A]^ = [B]^. 

When the strong cut is used, we have the same results. 
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Proposition 2. Assume A and B are fuzzy multisets of X . The necessary and 
sufficient condition for A C B is that for all a G [0, 1), ]A[^ C ]B[^. Moreover, 
the condition for A = B is that for all a € [0, 1), ]A[^ = ]B[^. 

Proposition 3. Assume A and B are fuzzy multisets of X. Take an arbitrary 
a G (0, 1]. We then have 

U B]^ = [Gi]„ U [B]^, [Gi n B]^ = n [B]^, 

® B]^ = [^]a ® [BU X B]^ = [A]„ X [B]^ 

U BL = ]^L u ]SL, ]A n b[^ = n ]b[^, 

® B[^ = 0 ]B[^, ]A X B[^ = ]A[„ X ]B[^. 

Proposition 4. Assume A, B, and C are fuzzy multisets of X. The followings 
are valid. 

A UB = BU A, 

A r\B = Bn A, 

A U{BUC) = {AU B)UC, 

A n{BnC) = {An B)nc, 

{AnB)UC = {AuC)n{BU C), 
{AuB)nc={Anc)u{Bn c). 

The class of all fuzzy multisets of a particular universe thus forms a distributive 
lattice. 



The Number of Elements. The number of elements, or cardinality of a fuzzy 
multiset A is given by 

L(x;A) 

xGX 



Moreover we define 



1^ 



L{x;A) 



We thus have | A |= I ^ U- 

We moreover introduce an ^ 2 -norm for A for later use: 



It is easily seen that 



We easily have 



||T||= ^ ITT. 



A 



xGX 



L(x;A) 

iixi f = E E ( i ^ a (^)}"- 

xGX 



using the Schwarz inequality. 



T P<|X|||T|| 



(3) 
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2.3 Images of Fuzzy Multisets 

Let us consider two images 



/«A» = 0{/(a:)} (4) 

xeA 

and 

f{A)=\J{f{x)} (5) 

xGA 

where A is a fuzzy multiset of X. The author has studied (4) which uses the 
addition 0 instead of the union (5) (cf. [15, 16]). The image (4) is not an ordinary 
one, since / ^ 3> is generally a multiset even when A is an ordinary set. It 

has been noted that such a multiset image arises in linear data handling in 
information processing [15, 16]. In contrast, the image by (5) is compatible with 
the ordinary image by the extension principle. 

Let us remind that a fuzzy set is represented by 

A — ... ■ 



We then have 

/ < A » = {(/(Xi),/ii)}i = l,...,g (6) 



Concerning (5), we use the membership sequence (1) for a fuzzy multiset A. 
We hence obtain 



M/(A)(y) 



max u\{x) 

xGf-^{y) 



( 7 ) 



(If / i(y) = 0 , then = 0). 

It is immediately to see that if A is an ordinary fuzzy set, the above relation 
implies the extension principle. 



2.4 iz-cut 

In contrast to the a-cut of a fuzzy set, another operation of z/-cut for multiset 
is defined. 

Let z/ be a given natural number. A zz-cut for a crisp multiset M , denoted by 
{MY , is defined as follows. 

{MY = {x G X : CountM{x) > zz}. 

Proposition 5. Let M, N he crisp multisets of X. Take an arbitrary v G 
{0, 1,2,.. .}. We then have 

{MU NY = {MYU{NY, 

{munY = {MYn{NY- 



Notice also that, for the addition, {M 0 NY Y ® {^Y general. 
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We see that the v-c\\t corresponds to the a-ciit for ordinary fuzzy sets. It 
thus is naturally defined. We proceed to generalize the i^-cut to fuzzy multisets. 

Let A be a fuzzy multiset of X and be a given natural number. The v-cut 
of A, denoted by (A)'^ , is a fuzzy set whose membership is given by 

= fJ'Aix), xGX. (8) 

In other words, the membership of (A)’' is the i^th value in the membership 
sequence of A. 

We have the following propositions which show the validity of the definition. 
Proposition 6. Let A he an arbitrary fuzzy multiset of X. Then, 

= UY]^ 

holds. Namely, an a-cut and a v-cut are commutative. 

Proof. 

X G ([^]„)^ ^ Count^j^ (x) >vG^ (J-aY) > Q: > a 

^xg[{ay\ 

□ 

Proposition 7. Let A and B he arbitrary fuzzy multisets of X. Take any v G 
{0, 1,2,.. .}. We then have 

{aubY = {aYu{bY, 

{An BY = {AY n {BY . 

Note also that for 0, {A 0 BY Y ® {^Y general. 

2.5 Cuts and Images 

While the a-cut is commutative with each of the union, intersection, and addi- 
tion, the iz-cnt does not commute with the addition. This implies that for a fuzzy 
multiset A, 

(/ « A »)" ^ / « {AY » 

in general. In contrast, it is easily seen that 

{f{A)Y = f{{AY). 

We thus have the following. 

Proposition 8. The image /( ■ ) defined by (5) or by (7) commutes with the 
a-cut, v-cut, and the union. Ln contrast, / <C • by (4) or (6) commutes with 
a-cut and the addition. Namely, 

f{[A]Y = [f{A)Y 
f{{AY) = {f{A)Y 

f{AiJB) = f{A)Uf{B) 
f « > = [/ < A »]^ 

/<A0B> = /<A»0/<B». 
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3 Application to Rough Sets, Information Retrieval, 
and Automatic Classification 

We consider three applications of fuzzy multisets to rough sets, information 
retrieval, and automatic classification problems. 



3.1 Rough Fuzzy Multisets 

Dubois and Prade [3] have generalized rough sets [21] and defined rough fuzzy 
sets. Assume that a classification X/Roi X is given. The upper approximation 
of a fuzzy set A which is written as R* [A] , and the lower approximation written 
as i?* [A] are defined by the following: 

^^R^[A\{Y) = max 

x£Y 

P^r.[A]{Y) = min /xa(x). 

x£Y 



where Y € XjR. 

Notice that the above rough approximations are commutative with the a-cut. 
Namely, the followings are valid. 

i?*[[A]J = [R*[A]]„, 

The upper approximation can be described in terms of the extension prin- 
ciple. Namely, let g be the natural mapping of X onto XjR. That is, for an 
arbitrary x € A, there exists Y G XjR such that x G Y, whereby we define 
g{x) = Y . Then i?*[A] satisfies 

^^R*[A]{Y) = gg(A){Y). 

The above argument implies that the upper approximation of a fuzzy multiset 
is defined by using the same mapping g{A). 

Assume that A is a fuzzy multiset of X. The upper approximation of A, 
denoted by i?*[A], is defined by g{A). {g{-) is the natural mapping of X onto 
X/R.) Thus, 

= ^^l{A)i^) = maxAi^^(x), 
j = 

In contrast, the lower approximation cannot be defined using an image. 

Assume that A is a fuzzy multiset of X. The lower approximation of A, 
denoted by i?* [A] is defined by 
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Proposition 9. Let A he an arbitrary fuzzy multiset of X. The following equa- 
tions then hold. 

[R*[A]]^ = R*[[A]J 
{R*[A\y = R* [{Ay ] 

[RM]L = R*UU 

{RMr = 

Proof. It is easy to see that the first two equations are valid, observing properties 
of the image. For the third equation, we note 

Count = n<^ A Count^jij^ (x) = n 

x^Y 

gY, y^{z) > a, j = l,...,n, 

3z € V, < Q! fc = n + 1, . . . , 

^ A ^ j = 1, 

xGY 

/y fJ-Aix) < a, k = n + 1, . . . 

xGY 

Count[R,[A]]yY) = n. 

{RyA\Y = R*[{^Y] is shown in a similar manner. We omit the detail. □ 

3.2 Fiizzy Database and Information Retrieval 

Fuzzy database systems have been considered by many researchers (e.g., [22]). 
Let us see how fuzzy multisets arise from a simple operation to fuzzy database. 
Although there are different frameworks for fuzzy database, a most simple form 
is a set of tuples t = (oi, 02 , . . . , uat, with the degree of relevance yL. Let 
us consider a simple example of fuzzy database FD of two tuples: FD = 
{(oi, fl 2 , /r), (a']^, 02 , /r')} where oi yf a[. Thus FD is not a fuzzy multiset but 
an ordinary fuzzy set. A simple operation of SELECT of the second column 
(attribute) leads to {( 02 , /i), ( 02 , /i')} which is a fuzzy multiset. Another inter- 
pretation is the application of /((ai, 02 )) = (« 2 )- Then, 

/ < {(ai,a2,/i),(ai,02,Ai')} > = {{^2,^), {a2,h')}- 

Generally, the SELECT operation of an ordinary database uses such / <C • S> 
and hence its extension to fuzzy databases produces fuzzy multisets. The reason 
why / ^ instead of the ordinary /(•) is used is that the former is a simple 
sequential operation in which no check whether two elements are equal or not is 
necessary. Thus / ^ » is far more efficient than /(•). Since a database system 
should have union and intersection operations, those for fuzzy multisets should 
be implemented in fuzzy database systems. 

Researches of information retrieval systems on the web is now very active. For 
developing advanced search capabilities, consideration of an appropriate model 
of information retrieval is absolutely necessary. 
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Items of information on the web arise redundantly and with degrees of rele- 
vance. The same information content may appear many times by a search; some 
of them are judged to be more relevant and others not. Such state of informa- 
tion is best be captured using fuzzy multisets with multiple occurrences and 
memberships for relevance. 

We will briefly mention some issues in information retrieval based on fuzzy 
multisets here, but further discussion of them except classification problems 
studied in the next section is omitted to save the space. 

First, the retrieval operation has the form of / <C • of the sequential pro- 
cessing of information on the web. Note again the above remark that / <C • 
is more natural and efficient than the mathematical /(■). 

Advanced retrieval function should use dictionaries whereby associative re- 
trieval can be carried out [12, 13]. Such retrieval should employ fuzzy mul- 
tirelations defined above, and then it is straightforward to extend methods of 
associative retrieval to the case of fuzzy multisets. 

Another method is retrieval of similar documents in which a measure of 
similarity between two documents or elements of information should be defined. 
We can use metrics discussed in the next section for this purpose. 



3.3 Classification and Clustering 

Although most studies in automatic classification in engineering concentrates 
pattern recognition [4], classification problems have also been discussed in rela- 
tion to information retrieval frequently. 

There are two major categories of automatic classification problems: super- 
vised classification and unsupervised classification. As a major part of classifica- 
tion problems is supervised, the former is sometimes simply called classification, 
while the latter, unsupervised one is called clustering. 



Supervised Classification. For supervised classification, strong theory such 
as the Bayesian technique [4] are available. However, most techniques are based 
on the assumption of the Euclidean space and we have a weaker structure of 
fuzzy multisets, such parametric techniques are unusable. On the other hand, 
nonparametric methods such as the nearest neighbor (NN) and AT-nearest neigh- 
bor (KNN) are applicable, if a metric D{A, B) is defined between an arbitrary 
pair of fuzzy multisets (cf. [4], Chapter 4). For simplicity, suppose fuzzy mul- 
tisets Bi, . . . ,Bm are given and they are classified into two classes C and C . 
Moreover suppose a new fuzzy multiset A should be classified into one of these 
two classes. 

NN. Find nearest element to A-. B = arg min d(A,Bi) and it B G C, clas- 

sify A into C\ it B G C , classify A into C . 

KNN. Find K nearest elements to A, that is, those BiS that have K smallest 
distances from A. Classify A into the class which has the majority out of 
the K nearest elements. 
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For the purpose of applying these techniques, we require a metric space of fuzzy 
multisets. 

There are different types of metric spaces for fuzzy multisets. We consider 
two different metrics Di and D 2 '- 

Di{A,B) = \AuB\-\AnBl (9) 

D2{A,B) = \/| a TaA I + I BTaB I -2 I a TaB \. (10) 

It is easy to see these metrics correspond to the Li and L 2 metrics by observing 
Di{A,B) I, 

xex j 

D2{A,BY = XI I P ■ 

xGX j 

We hence have | A |= Di{A,^) and ||A || = Z) 2 (A, 0). 

The nearest neighbor and iF-nearest neighbor methods of classification based 
on these metrics are now straightforward, as seen above. 

Clustering. Document clustering has also been studied [26, 12] and recently 
there are search engines employing clusters [6]. 

Fuzzy clustering of documents using Di and D 2 is studied in [19] where it 
is shown that the major difference is in calculating cluster centers in fuzzy c- 
means [1]. 

Further studies on fuzzy clustering include the use of the kernel-trick in 
support vector machines [27] whereby nonlinearities of cluster boundaries are 
handled. For the detail, see [20]. 

Note. The space of fuzzy multisets is essentially infinite dimensional, even when 
the underlying space X is finite and we are interested in finite multisets, i.e., 
those having finite occurrences in X x [0, 1]. The infinity comes from the limit of 
a sequence of finite fuzzy multisets is an infinite fuzzy multiset. This fact can be 
ignored in most applications of fuzzy multisets, but such fuzzy multisets may be 
of interest in future theoretical studies. When handling infinite fuzzy multisets, 
such inequalities as (3) are important. 

4 Conclusion 

We have over vie wed fuzzy multisets and considered two types of applications: 
an application is theoretical and rough sets are discussed; another includes in- 
formation retrieval and automatic classification of objects. 

Recent methods of information retrieval include rough set-based retrieval [17], 
where the use of fuzzy multisets should further be studied. 

As noted in the introduction, areas of multiset applications are becoming 
broader. Moreover we are encountering fuzzy multisets unconsciously. For ex- 
ample, populations in genetic algorithms are considered to be fuzzy multisets. 
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although current theory may not be very useful to genetic algorithms yet. How- 
ever, new operations may be added to fuzzy multisets and on the other hand 
genetic algorithms can include some features of fuzzy multisets. Future studies 
encompassing these fields seem to be promising. 
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Abstract. An AR model, a classical neural feedforward network and an 
artificial fuzzy neural network based on B-spline member ship functions 
are presented and considered. Some preliminary results and further ex- 
periments that we performed are presented. 
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1 Introduction 

Most models for the time series of stock prices have centered on autoregressive 
(AR) processes. Traditionally, fundamantal Box-Jenkins analysis [2] have been 
the mainstream methodology used to develop time series models. The paper 
compares the forecasts from autoregressive (AR) model of stock prices and neu- 
ral network specifications. Our motivation for this comparison lies in the recent 
increasing interest in the use of neural networks for forecasting purposes of eco- 
nomic variables. In [6] the stock price autoregressive (AR) models based on the 
Box-Jenkins methodology [2] were described. Although an AR model can reflect 
well the reality, these models are not suitable for situations where the quantities 
are not functionally related. In economy, finance and so on, there are however 
many situations where we must deal with uncertainties in a maner like humans, 
one may incorporate the concept of fuzzy sets into the statistical models. 

The primary objective of this paper is a focused introduction to the 
autoregressive model and its application to the analyses and forecasting. In Sec- 
tions 3 and 4, we present neural network approaches to model the some time 
series readings. The application of ANN to the stock price forecasting is based 
on the assertion that data in stock price time series are chaotic, and the rela- 
tionship between inputs and outputs is non-linear. A potent testing procedure is 
needed at first. The detection of non-linear hidden patterns in this kind of time 
series provides important information about their behaviour and improves the 
forecasting ability over short time periods. For some tests and more profound 
theoretical background, the reader should refer to the [3], [9]. In Section 5, we 
give some empirical results. 
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Fig. 1. The data for VAHOSTAV stock prices (January 1997 - August 1997) and the 
values of the AR(7) model for VAHOSTAV stock prices estimated by GL algorithm 



2 AR Modelling 

We give an example that illustrates one kind of possible results. We will regard 
these results as the referential values for the approach of NN and fuzzy NN 
modelling. 

To illustrate the Box- Jenkins methodology, consider the stock price time 
readings of a typical company (say VAHOSTAV company). We would like to 
develop a time series model for this process so that a predictor for the process 
output can be developed. The data was collected for the period January 2, 1997 
to December 31, 1997 which provided a total of 163 observations (see Fig. 1). 
To build a forecast model the sample period for analysis yi, ..., y\ 2 B was 
defined, i.e. the period over which the forecasting model was developed and the 
ex post forecast period (validation data set), 2 / 129 , . • . , j/ies as the time period 
from the first observation after the end of the sample period to the most recent 
observation. By using only the actual and forecast values within the ex post 
forecasting period only, the accuracy of the model can be calculated. 

After some experimentation, we have identified two models for this series 
(see [2]): the first one (1) based on Box- Jenkins methodology and the second 
one (2) based on signal processing. 



yt = i + aiyt-i + 022/4-2 + 0 , 

•7 


t = l, 2, .. 


1 

to 


(1) 


1 

2/t = - ^ akVt-k + et, 


t = l, 2, .. 


., iV-7. 


(2) 



k=l 



The final estimates of model parameters (1), (2) are obtained using OLS 
(Ordinary Last Square) and two adaptive filtering algorithms in signal process- 
ing [1]. The Gradient Lattice (GL) adaptive algorithm and Last Squares Latice 
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Table 1. OLS, GL and LSL estimates of AR models 



Model 


Order 


Est.proc 


ai 


02 


03 


0,4 


05 1 ae 1 07 


RMSE* 


(1) 


2 


OLS 


1.113 


- 0.127 






5 = 26.639 


67.758 


(2) 


7 


GL 


- 0.7513 


- 0.1701 


- 0.0230 


- 0.0128 


- 0.0028 


- 0.0472 


0.0084 


68.540 


(3) 


7 


LSL 


- 0.8941 


- 0.6672 


- 0.7346 


- 0.2383 


0.1805 


- 0.5692 


0.4470 


94.570 



*ex post forecast period 



(LSL) algorithm representing the parameter estimates of the predictors (1), (2) 
were used. In Tab. 1 the parameter estimates for model (2) and corresponding 
RMSE’s are given. The Fig. 1 shows the GL prediction results and actual values 
for stock price time series in both analysis and ex post forecast period. 



3 Neural Network Approach 



The structure of an ANN is defined by its architecture, its activation function 
and learning algorithm. While many variations are possible we suggested an al- 
ternative of the most common form of NN which was suggested and described 
in [4]. This variant of NN is pictured in Fig. 2. Fig. 2 shows a fully connected 
and strictly hierarchical NN with a variational number of inputs, further vari- 
ational number of hidden layer units and one output unit. Processing units of 
hidden layer have activation function S - shaped tanh, which produces values 
of outputs Oj, j = 1, 2, . . . , s ranging from —1 to 1. Processing units of hid- 
den layer have the associated weights Wrj, j = 1, 2, . . . , k. Input data Xr of 
the NN are standardised variables. The standardised version of the variables is 
created in data preprocessing units, i.e., in input layer. The system in a prepro- 
cessing unit substracts the mean of the variable from each observation in the 
variable and divides the result by the standard deviation of that variable. After 
standardisation all input variables Xr have values ranging from —1 to 1 and the 
bias equal zero. As mentioned in [-5] many authors stated that for standard- 
ised input multilayer percepron networks are more powerful than radial basis 
networks despite the fact that a theoretically strong explanation is still lacking 
or is not well understood. Hidden layer weights Wrj are estimated from data 
according to the learning technique and choice of measure of accuracy in any 
NN application. The processing units of a hidden layer produce output values 
as 



Oj = tanh I XrWrj 

\r-=l / 



, j = 1, 2, . . . , s 



( 3 ) 
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Fig. 2. Fully connected single hidden layer network 



A dependent variable y is produces in an output unit. The output layer 
unit produces a dependent variable y so that the hidden layer outputs Oj, 
j = 1, 2, . . . , s are each multiplied by an additional parameter (weight) es- 
timated from the data. A backpropagation algorithm for weights estimating is 
used. 

4 B-spline Neural Network Approach 

The concept of fuzzy neural network (FNN) can be approached from several 
different avenues. The one that we have used for stock price forecasts is shown 
in Fig. 3. This figure shows the FNN with p x n input neurons (input layer), 
a single hidden layer with p processing units (fuzzy neurons) and one output 
unit. 

Input selection is of crucial importance to the successful development of FNN 
models. In models (1) and (2) potential inputs were chosen based on traditional 
statistical analysis: these included the raw stock price series and lags thereof. The 
relevant lag structure of potential inputs was analysed using traditional statis- 
tical tools: ACF, PACF and the MSE criterion. All the above techniques are 
in reality imprecise (we developed parsimonious models, that is, models which 
adequately describe the time series yet contain relatively few parameters, the 
theoretical ACF was estimated by the sample ACF, etc.). In fact we obtain 
a certain number of input values, but we are sure that these values are one of 
many other possible values. Thus, we will further suppose that the potential 
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Fig. 3. The neuro fuzzy system architecture 



inputs, which were chosen based on statistical analysis, are fuzzy numbers char- 
acterized by a membership functions (the uncertainty is modeled as a possibility 
distribution) belonging to a class of bell shaped functions. 

Inputs to the fuzzy neuron in hidden layer are fuzzy numbers denoted 
j = 1, 2, . . . , p, k identifies the order of the B-spline basis functions. They 
express the neural input signals in terms of their membership functions based 
on B-spline basis functions of the data. This concept is often called as B-spline 
FNN [10]. 

Now, let us suppose that the system has B' ^ = [Bi^k,t, B 2 ^k,t, ■ ■ ■ , Bp^k,t] 
as inputs and y' = [yi, U 2 , • ■ •] as outputs. Then the information set ip describing 
process behaviour may be written in the form 



Each the j-th input neuron distribute the inputs to the j-th neuron in the 
hidden layer. Neural input signals are then weighted by weights denoted ojj^t- 
In general, the weights are in the range of (0, 1) Each processing unit performs 
internal operations on these neural inputs and computes the neural output signal 
a,j. The internal operations are based on aggregation, i.e., the sum of the products 
of weights and inputs, and its transformation into the neural output cij. These 
two internal operations for j-th neuron in the hidden layer are defined as 



for aggregation, where Uj is a measure similarity between the inputs and weights, 
and 




( 4 ) 




( 5 ) 



a, = f{U,) 



( 6 ) 
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for transformation operation, where / is the type of transfer function. We set 
this function to the identity, that is, f{Uj) = Uj. 

The neuron in the output layer provides simply the computation of 
Eqs. (1), (2) and produces output signal yt- 

The learning algorithm is based on error signal. The neural network modifies 
the weights ujj^t in synaptic connections with respect to the desired fuzzy system 
output yt- The error of the fuzzy system, i.e., the difference between the fuzzy 
system forecast yt and the actual value yt is analysed through the RMSE. Let 
yt be a linear function 

The measure of similarity (5) may be defined as the inner product of vectors 
and u>t-j{j) , that is 

Uj ~ (7) 

where: = [wt-j(j), uJt-j+i{j), ..., Wt-j+n(j)] is an 

1 X n row vector of the weights and Bj f,{yt-j) = 

= Bj^kivt-j+n)] is an 1 X n row 

vector of the B-spline functions. 

Next we show that the B-spline neural network may be considered as a fuzzy 
linear controler. We now define the vectors as follows: 

Let zj be an 1 X n row vector of the regressor variables 

= [j/t— yt—j+ij ■ • ■ ) yt—j+n}, J = 1) 2, . . . , p 

be an 1 X n row vector of the observations 

y'^ = [yi, y 2 , . . . , y„] 

and be an 1 x p row vector of the parameters 

= [oi, 02, . . . , Op]. 

Then the concept of B-spline FNN may be also considered as a well known 
Sugeno and Takagi [8] linear fuzzy controler which (in our notation) has the 
following form 

R = if f7i = oi and U 2 = 02 and . . . and Up = cip 

thenyt=a^Zj, t=l,2,...,n (8) 

where the fuzzy linear control rules R has been derived by neural network purely 
from the database describing previous or next behaviour of the system. 

5 Empirical Results 

The network described in Section 3 was trained in software at the Faculty of 
Management Science and Informatics Zilina. The statistical forecast accuracy of 
the FNN according to Fig. 3 depends on the type of transfer function in Eq. (6) 
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and the formulation of the B-spline curve in Eq. (7). The approximation is better 
the higer the value of k. All B-spline basis functions are cubic ones. Assume that 
the mesh points of B-spline basis function is Xi, * = 1, 2, 3, 4. Then the cubic 
B-spline basis functions {k = 3) for i = 1 have the form as follows 



D ^ i.yt-3 Xi) {yt-j Xi) 

ni,3(yt-j) = 7 77 r 

Xi+3-Xi (Xi+2 - Xi)(Xi+l - Xi) 



for yt-j G {xi, Xi+i) 



Bi,3{yt-j) 



{yt-j - Xi) ^ r {yt-j - Xi){xj +2 - yt-j) ^ 

Xi+3 - Xi [ {Xi+2 - Xi){Xi+2 - Xi+l) 



^ {yt-j - Xi+i){xi+3 - yt-j) 

{Xi+3 - Xi+l){Xi+2 - Xi+l) 



{xi+i - yt-j) 

{Xi+4 Xi+l) 



{yt-j - Xj+i)'^ 

{Xi+3 - Xi+l){Xi+2 - Xi+l). 

for yt-j G (xi+i,Xi+2) 



Bi,s{yt-j) = 



{yt-j 

Xi+3 



Xj) 

Xi 



(Xi+3 - Xj)^ 

(Xi+3 - Xi+l)(Xi+3 - Xi+2) 



-f 



^ (xj+4 - Xj) ^ {yt-j - Xj+i)(xi+3 - yt-j) (xj+4 - yt-j){yt-j - Xi+ 2 ) 

Xi+4 Xi+l {Xi +3 Xi+l)(Xi+3 Xi+2) (Xi+4 Xi+2)(Xi+3 Xi+2) 

for yt-j G (xi+2,Xi+3) 



Bi^3{yt-j) 



(xj+4 yt-j) (xj+4 yt-j) 

Xi+4 Xi+l (Xi+4 Xi+2)(Xi+4 Xi+3) 



for yt-j G (xi+3,Xi+4) 



Bi, 3 {yt-j) = 0 otherwise 

where yt-j, j = 1, 2, p, t £ A are observations. These mesh points are given as 
xi = min{yt_j} X3=max{yt-j} t£A 




Our FNN was trained on the training data set. Periodically, during the train- 
ing period, the RMSE of the FNN were measured not only on the training set but 
also on the validation set. The final FNN chosen for the stock price prediction is 
the one with the lowest error on the validation set. Note also, the training phase 
was finished after 5 • 10^ epochs, the best model being obtained after 2.3 • 10^ 
epochs. 

The RMSE’s of our predictor models are shown in Tab. 2. From this table can 
be seen that the basic (non fuzzy) artificial neural network architecture does not 
support its use for daily frequencies. The initial results of the FNN forecasting 
model are clearly better. 
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Table 2. 



Model 


RMSE* 


AR(2) 


67.7 


Basic (non fuzzy) neural network 


67.2 


FNN 


63.5 



* Validation set 
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Abstract. This paper presents a short evaluation about the integration of 
information derived from wavelet non-linear-time-invariant (non-LTI) 
projection properties using Support Vector Machines (SVM). These properties 
may give additional information for a classifier trying to detect known patterns 
hidden by noise. In the experiments we present a simple electromagnetic pulsed 
signal recognition scheme, where some improvement is achieved with respect 
to previous work. SVMs are used as a tool for information integration, 
exploiting some unique properties not easily found in neural networks. 



1 Introduction 

In previous "work we have introdueed a new algorithm to deteet the presenee (or 
absenee) of eleetromagnetie signals using optimum theoretie diseriminators [1] and 
Support Vector Machines [2] applied to wavelet transform output. This approach 
performs 15 dB better than previous algorithms using wavelets ([3] and [4]). The 
main advantage of our algorithm is its ability to integrate huge amounts of unrelated 
information, i.e., information coming from different sources (see figure 1). 

However, this algorithm needed a time search process to be sure that in the case a 
signal is emitted, our system will process all its energy in at least one window. In [5] 
we introduced a valid time search algorithm and gave some hints about its ability to 
process greater amounts of information at low computational costs. 

In this paper we focus on taking advantage of the time search process itself to 
improve the probability of detection (P^) and probability of false alarm (Pfa). The 
wavelet transformation shows how time variant properties can be exploited using 
Machine Learning (ML) tools. Support Vector Machines is a ML tool with 
remarkable properties [6], especially useful in this case because of its ability to easily 
impose greater error penalty on one of the classes only. 
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Fig. 1. Information processing scheme. Given a digitized signal, multiple transformations can 
be executed. Then a SVM will gather those features and classify them either as a pulsed signal 
or noise alone. Using a SVM, the threshold is defined as 0 (the threshold could be also 
identified as parameter b in the SVM definition) 



2 Tools 

2.1 Wavelet Properties of Interest 

Almost all existing signals can be described using a wavelet transform. Wavelets are 
generated by the scaling and translation of a single prototype function called wavelet 
mother [7]. The result of this transform is a set of vectors (called scales) whose 
coefficients describe the behaviour of the input data with respect to time and 
frequency. The discrete wavelet transform offers high time resolution for low scales 
(high frequencies) and high frequency resolution for high scales (low frequencies). 

One of the main attributes of wavelets transforms is that they are non-LTI, i.e., they 
are time variant. Given a system where for an input x(n) the output is y(n), it is said 
to be time invariant if, for a shifted input x(n-Uo), the output of the system is y(n-no), 
independently of the chosen time shift Uq. 

Suppose we define a known pulsed signal being N samples long (the sampling rate 
is usually defined by the digitizing hardware resources available). Let dk be the 
wavelet scale used to analyse the data, D an integer number, D < N, and H the size of 
the processing window, H > N (the unit is always one sample). Suppose we apply the 
simple wavelet transform to a window having a pulsed signal centered on it, and noise 
on both sides. Now let's shift the data in the window so as to have the pulsed signal D 
samples away from the center, and let's calculate the wavelet transformation again 
(see Fig. 2.). Of course, both windows output coefficients will be different, but as 
wavelets are non-LTI, those sets of coefficients will not have a direct relationship 
between them. Furthermore, even though they share the same source (the pulsed 
signal placed somewhere inside the window), they do not bear the same information. 

The main difference between processed windows is that they hold different 
projections about one same reality. Each projection is not complete, it losses 
information. On the other hand, because of time variance, nearby windows have some 
degree of complementary information, which can be used to upgrade the overall 
picture about the input data. 
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Fig. 2. Input data (somehow exaggerated for visibility purposes). For both input data the x-axis 
is time and y-axis is received power. The size of the processing window is H, the size of the 
pulsed signal is N and the difference (i.e. time) between both snapshots is D. Both raw signals 
inside window H have the same amount of basic class information (the N-size pulsed signal is 
complete inside both), but after wavelet processing projections will be complementary 

For instance, in table 1 a cross-correlation coefficients matrix can be seen, relating 
different projections of the same pulse using different shifts (values of D), as will be 
defined on the experiments section. 

Table 1. Cross-correlation coefficients matrix for 10000 pulse observations and three variables 
(0-shift, 1 1-shift and 23-shift) 

"1.0000 0.9123 0.9573" 

0.9123 1.0000 0.8977 
^0.9573 0.8977 1.0000, 

Note that although the three variables are clearly not independent, they are not 
completely correlated, that is, all three variables hold some small portion of 
individual, unshared information. We can apply here the entropy concept, described 
by Shannon in [8]. The term entropy of a random variable Fl(x) is a measure of the 
uncertainty hold by that variable x, i.e. the information it contains. In our case, we can 
analyse the increase of the entropy in the system when we add a new variable y, not 
independent to x (for instance, the 0-shift and the 1 1-shift variables). This relation is 

H(x,y)<H(x) + H(y). (1) 

The conditional entropy of y after knowing the value of x is also defined as 

Hx(y) = H(x,y) - H(x) . (2) 

Therefore, using (1) and (2) we get to 

H(x) + H(y) > H(x,y) = H(x) + Hx(y) . (3) 

In other words, as both variables are not independent, the information contained in 
both of them jointly is less than the sum of each of them separately. That extra chunk 
of information, Hx(y), is related to the degree of independence between x and y, 
which can be observed in the cross-correlation matrix in table 1 and 2. 
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Moreover, the more ineomplete pulse information sourees we use, the better results 
we will obtain. As more variables are introdueed in the model, new uneorrelated 
information will be harder to find, but nevertheless there will always be some 
improvement. Of eourse, there must be a limit to this proeess. Suppose x is the set of 
M possible projeetions {xi, X 2 , .., Xm}, then its entropy ean be written as 

M 

H(x) = H(xi) + XH{x,a<j}(Xj). (4) 

J=2 



Then, the limit to this proeess is 



Lim H(x) . (5) 

M— 

Nevertheless, this eonvergenee analysis is not in the seope of this paper. It will be 
eovered in future researeh by the authors. 

Given two variables, if they are eompletely eorrelated, then they share all the 
information, i.e., one of them is useless. On the other hand, if two variables defined as 
two different representations of one single event are not eompletely eorrelated, then it 
is highly probable that a ML proeess will be able to extraet different unshared 
information from both of them, improving elassifieation rates. 

Using the eross-eorrelation eoeffieients matrix of the input data we ean perform 
some rough estimates about how uneorrelated are eaeh variable against the other 
sourees of information. In table 2 another matrix is shown similarly to table 1 . In this 
table we have added two more variables, defined as 37-shift and 53-shift. Note that 
both new variables are highly eorrelated (its relation coeffieient is 0.9951, very elose 
to 1), and one of them would be useless in the elassifieation funetion. Nevertheless, 
these two projeetions ean still have some sort of information not present in the other 
three representations of the same event. They will be useful, but only one of them. 

Table 2. Cross-correlation coefficients matrix of 10000 pulse observations and five variables 
(0-shift, 1 1-shift, 23-shift, 37-shift and 53-shift) 



1.0000 


0.9123 


0.9573 


0.9209 


0.9201 


0.9123 


1.0000 


0.8977 


0.9131 


0.9171 


0.9573 


0.8977 


1.0000 


0.9474 


0.9449 


0.9209 


0.9131 


0.9474 


1.0000 


0.9951 


0.9201 


0.9171 


0.9449 


0.9951 


1.0000 



2.2 SVMs 

Support Vector Machine (SVM) is a Machine Learning tool introduced by V. Vapnik 
in 1995, arising from Structural Risk Minimization theory and VC dimension [9]. 
Since then, it has been used in a variety of problems with excellent results. 
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The simplest definition of SVM is related to the elassifieation task. A SVM 
separates two elasses using an optimum diseriminator hyperplane so as to maximize a 
eonvex quadratie objeetive funetion. 

1 j 1 1 

i=l ^ i=l j=l 

where Xj are the training patterns, yi are their elass and tti are the pattern weight 
eoeffieients. These eoeffieients are found during training, and are used in the test 
phase, along with independent term b. After the training, eaeh new test data z ean be 
classified using 

1 

f(x) = ^aiyi(Xi *z) + b . (7) 

i=l 

Note that all X; are training data and only those having tti > 0 will affect the 
separating hyperplane definition. These data are called support vectors. 

The SVM algorithm can also be slightly modified to accept non-linear separators in 
input space. For that purpose, everywhere in the formulas were a dot product appears, 
it can be substituted by a kernel operation complying 

K(x,y) = 0(x).0(y) , (8) 

where O is a transformation function to a higher-dimension space [6] . 

This algorithm has remarkable properties: there is one solution only (no local 
minima); SVM parameters are few and easy to handle; data separation can be 
performed in a very high dimensional feature space, making it very powerful; new 
feafures are not calculated explicitly, so there is no complexity increase regardless of 
the use of very high dimensional spaces; expected noise figures are easily introduced 
in the training algorithm, upgrading robustness; generalization capability is 
outstanding, despite the high dimensional space. 

The training process needs two opposite-class sets. In our experiments we required 
an asymmetric separation surface, i.e, false positives and false negatives have 
different importance (false positives, Pfa , were expected to be around 10'^ and false 
negatives, 1- Pa , were expected around 10"'). Even though this is not an uncommon 
case in Machine Learning, algorithms such as neural networks need either a modified 
learning strategy or a modification of the training set to comply with such asymmetry. 
Using the first approach requires knowledge of optimization theory and convergence 
analysis; the second approach requires greater computational resources, as one of the 
sets is increased at a rate 100 to 1 in this example. On the other hand, the SVM 
algorithm allows every individual training data point to have a different error weight 
through the soft-margin parameter C. Our implementation has a twofold C value (C+ 
and C- version), allowing us to impose easily (and fast) greater error penalty on false 
positives. 
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3 Integration Algorithm 

3.1 Time-Shift Scheme 

The time search algorithm described in [5] gives some constraints to the parameters 
N, D and H introduced in the previous section. As the wavelet calculation is the most 
expensive step throughout the algorithm, it is wise to reduce the size of the window H 
to the minimum. Therefore we set H=N (preferable N=2"). To guarantee we will 
always be able to obtain the complete N-samples pulse in at least one N-samples 
window we can skip no sample, so the shift distance between two consecutive 
windows should be set to 1. Nevertheless, depending on the scale used and the pulse 
form itself, the shift distance between consecutive windows could be lightly increased 
so as to minimize computational cost (see figure 3). 



Window k 



Window k + 



Window k+ Dj 



Fig. 3. Tailoring of basic time search algorithm described in figure 2. As H=N, only one 
window will have the complete pulse inside. All other windows will have some shift Dj that 
adds exactly that amount of noise-alone samples at the input 

Note that, unlike the general algorithm described in [5], which could have more 
than one consecutive H-samples window with the N-samples pulse inside, these 
parameter constraints define one complete-pulse and many incomplete-pulse 
projections. Even though the incomplete -pulse windows should bear less information 
as D gets higher, they contribute with some degree of classification upgrade, as will 
be observed in the experiments section. 

After defining these concepts, we have to choose which shift distances with respect 
to the basic, centered-pulse window (D values) will be used for information 
integration in the last phase of the algorithm. Such choice depends on several factors: 
pulse frequency, pulse size, sampling rate, and wavelet scales used in the first steps. 
Usually, consecutive windows (one sample shift away) projections would be very 
much alike. Two similar projections are of no use together, they bear no more 
information as a whole than separately. In our projection choice, we need to have a 
balance between non-similarity (not too close) and usefulness (having a big chunk of 
source data, therefore, not too far away). The smallest the shift with respect to the 
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complete -pulse window the better, but not so close as to have redundant projections. 
Also, as wavelets are built using powers of two, it seems not wise to use even shifts. 

Having these set of experimental rules in mind, in our experiments we chose as 
shift D prime numbers around one percent away from the complete-pulse window 
onwards. 



3.2 SVM Integration 

To integrate the information coming from different sources (see figure 4) we need a 
tool able to perform a statistical evaluation on how a separator decision function can 
be sustained on more than one variable or feature. SVM are a very nice tool at 
founding the best separator surface for such a scenario. Using non-linear kernels we 
are able to generate non-linear relationships between input features that may adapt 
better to the statistics behind the sources of information, and use all the potential 
given by the entropy measure. 

The output of each linear SVM is a random variable (the SVM soft output, before 
threshold 0 is applied) with some mean depending on the presence or absence of the 
n-shift scheme where that SVM has been trained. 

In the example shown in figure 4, we have three different sources, and therefore we 
have three linear SVM trained to detect different schemes, for instance those shown in 
figure 3. All time windows define a H-samples set that is analyzed through the 
Wavelet-l-LinearSVM, giving a real number as the output. The execution of the 
integration tool is done once for each window as follows: first, set the current window 
as Wo and extract the LinearSVM-DO as the first integration input; second, count Di 
windows to the left, set it as Wi and extract the LinearSVM-Dl as second integration 
input; third, again from Wo count D 2 windows to the left, set it as W 2 and extract the 
LinearSVM-D2 as the third integration input; fourth, execute the integration tool 
using the previously extracted inputs, obtaining a hard output (either there was a pulse 
on window Wo or not). 



window W+Di 



0 



DWT 
{dt} scales 



window W+D 



DWT 

{dt} scales 



window W+Di 



DWI 

{dt} scales 





Fig. 4. Integration procedure for information coming from different projections 
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To integrate all this information, we used a non-linear inhomogeneous polynomial 
kernel of degree two. Greater degree polynomial kernels (up to the number of input 
features) performed worse, so did the linear kernel to a lesser extent. Those non-linear 
features gave a flexibility to the deeision funetion good enough to attain an optimum. 
Nevertheless, the ehoiee of kernel in SVM is not of mueh importanee regarding small 
and not-too-diffieult problems as the one described in this paper. In [6] a brief 
analysis about how to select a kernel in SVMs is shown. 

But the main advantage of using the SVM was its capability to give different 
penalty cost to errors of each class, as was mentioned on previous sections. We used 
the twofold C parameter (C+ and C-), which gave us the possibility to fulfil the very 
restricted requirements on Pfa analysis. Neural Networks do not posses this capability 
incorporated to the basic training algorithm, and therefore it could not be used as such 
tool. 

3.3 Algorithm Complexity 

Let us set the multiply-add as the basic operation. Let us define H as the initial vector 
size, W as the wavelet filter size and K as the wavelet scale, then the complexity of 
wavelet calculation is O(KWH). Let us also define S as the size of the coefficient 
vector at scale K (S=H/2^) and M as the number of linear SVM classifiers trained for 
different complete / incomplete pulse schemes, then linear classifiers execution is 
0(MS). The non-linear integration SVM uses very few input features (M), and 
therefore it can be easily described with a small set of support vectors (not much 
greater than M), using the reduced set approach by Burges [10]. 

Adding up these intermediate processes, computational requirements are 
O(KWH) -I- 0(MS) + O(M^). Therefore, the algorithm complexity remains bounded to 
the wavelet transform computational cost O(KWH). Therefore, for a small additional 
cost (the new integration step) we obtain much better results, as can be observed 
throughout the experiments. 



4 Experiments 

Our experiments had the following setup: chirp pulse (see [1]), 1024 samples size 
(N); mother wavelet Daubechies 5, using d4 wavelet scale coefficients; white 
Gaussian noise with zero mean and deviation equals one; five linear detectors (SVMs) 
were trained such as to detect the complete pulse (named 0-shift), 1 1 noise samples 
plus 1013 pulse samples (named 11 -shift), and a similar approach for 23 samples 
(named 23-shift), 37 samples (named 37-shift), 53 samples (named 53-shift) and 
having Pfa = 10“^, as established in [2]. 

Thus, for each window observation, we computed 5 similar Wavelet -I- Decision- 
function schemes. We generated two integration SVM, one having as inputs the 0-, 
11- and 23-shift smooth outputs, and the other one having 0-, 11-, 23-, 37-, 53-shift 
smooth outputs, all of them extracted from the corresponding windows. Fig.s are 
expressed as the probability of detection mean on some fixed SNR interval with 
respect to desired probability of false alarm. 
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Fig. 5. Performance functions for individual wavelet+LinearSVM schemes for complete- and 
incomplete- pulse windows. The X axis corresponds to - logio(Pfa). The Y axis corresponds to 
the mean for Pj results for SNR values between 0 and -15 dB 




Fig. 6. Performance functions for multiple wavelet+LinearSVM plus integration SVM schemes 
for two sets of individual projections: [0,11,23] and [0,11,23,37,53]. Also the individual 
schemes of figure 5 are shown for comparison. The X axis corresponds to - logio(Pfa). The Y 
axis corresponds to the mean for Pj results for SNR values between 0 and -15 dB 

Finally, in figure 6 we can see the effects of multiple projection integration. The Pfa 
upgrade is a bit less than one order of magnitude, increasing as Pfa requirements 
become harder. Note also how two added variables (37-shift and 53-shift), which 
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were observed as having a great correlation between them and considerable less class 
information, are able to improve substantially our results. 

5 Conclusions 

In this paper we have seen a new unlimited line of information processing: the use of 
additional Wavelet + Decision-function scheme applied to previously determined 
incomplete pulsed signals. This new set of features provides the final model with 
uncorrelated information, upgrading the classification rates. 

Non-LTI properties in wavelet transforms define different useful projections of 
reality, which yield unshared information. In our experiments we have analysed the 
case when multiple incomplete-pulse schemes are executed, but this algorithm can be 
used also on multiple complete-pulse schemes (need only H > N as defined on 
section 2), obtaining better overall results. In our experiments we wanted to 
emphasize the fact that even when processing less energy (shifted window), the 
resulting projection may have additional useful information to the complete pulse 
window projection. 

Support Vector Machines is a great tool for information integration for linear as 
well as non-linear kernel functions. In this application we have confirmed the easy of 
use of SVMs as a Machine Learning tool regarding its capability to provide different 
weights for both classes, allowing the training system to comply with the otherwise 
difficult probability of false alarm rates (several orders of magnitude lower than 
probability of detection). 

Further analysis is needed to determine how an increased number of processing 
units in one multiple-source decision function will upgrade the system capabilities. 
We will also analyse in more depth how the shift number (we used small prime 
numbers only) affects the pulse projection, and its entropy convergence. 
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Abstract. In this paper, we examine the effect of weighting training 
patterns on the performance of fuzzy rule-based classification systems. 
A weight is assigned to each given pattern based on the class distribution 
of its neighboring given patterns. The values of weights are determined 
proportionally by the number of neighboring patterns from the same 
class. Large values are assigned to given patterns with many patterns 
from the same class. Patterns with small weights are not considered in 
the generation of fuzzy rule-based classihcation systems. That is, fuzzy 
if-then rules are generated from only patterns with large weights. These 
procedures can be viewed as preprocessing in pattern classification. The 
effect of weighting is examined for an artificial data set and several real- 
world data sets. 



1 Introduction 

Fuzzy rule-based systems have been applied mainly to control problems [1, 2, 3]. 
Recently fuzzy rule-based systems have also been applied to pattern classifica- 
tion problems. There are many approaches to the automatic generation of fuzzy 
if-then rules from numerical data for pattern classification problems. Genetic 
algorithms have also been used for generating fuzzy if-then rules for pattern 
classification [4, 5, 6]. 

In this paper, we examine the effect of weighting training patterns on the 
performance of fuzzy rule-based classification systems. A weight is assigned to 
each given pattern based on the class distribution of its neighboring given pat- 
terns. The values of weights are determined proportionally by the number of 
neighboring patterns from the same class. Large values are assigned to given 
patterns with many patterns from the same class. Patterns with small weights 
are not considered in the generation of fuzzy rule-based classification systems. 
That is, fuzzy if-then rules are generated from only patterns with large weights. 
These procedures can be viewed as preprocessing in pattern classification. The 
effect of weighting is examined for an artificial data set and several real-world 
data sets. 
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1.0 0.0 




Attribute value 
(d) Five fuzzy sets 



Fig. 1. An example of antecedent fuzzy sets 



2 Fuzzy Rule-Based Classification System 

2.1 Pattern Classification Problems 

Various methods have been proposed for fuzzy classification [7]. Let us assume 
that our pattern classification problem is an n-dimensional problem with C 
classes. We also assume that we have m given training patterns Xp = {xp\,Xp 2 , 
. . . , Xpn), p = 1,2, . . . ,m. Without loss of generality, each attribute of the given 
training patterns is normalized into a unit interval [0,1]. That is, the pattern 
space is n-dimensional unit hypercube [0, Ij” in our pattern classification prob- 
lems. 

In this study, we use fuzzy if-then rules of the following type in our fuzzy 
rule-based classification systems: 

Rule Rj: If xi is Aji and X 2 is Aj 2 and . . . and is Ajn 

then Class Cj with CFj, j=l,2,. . . ,N , (1) 

where Rj is the label of the j-th fuzzy if-then rule, Aj \, . . . , Ajn are antecedent 
fuzzy sets on the unit interval [0, 1], Cj is the consequent class (i.e., one of the 
given C classes), CFj is the grade of certainty of the fuzzy if-then rule Rj, 
and N is the total number of fuzzy if-then rules. As antecedent fuzzy sets, we 
use triangular fuzzy sets as in Fig. 1 where we show various partitions of a unit 
interval into a number of fuzzy sets. 

2.2 Generating Fuzzy If-Then Rules 

In our fuzzy rule-based classification systems, we specify the consequent class 
and the grade of certainty of each fuzzy if-then rule from the given training pat- 
terns [8, 9, 10]. In [10], it is shown that the use of the grade of certainty in fuzzy 
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if-then rules allows us to generate comprehensible fuzzy rule-based classification 
systems with high classification performance. 

The consequent class Cj and the grade of certainty CFj of fuzzy if-then rule 
are determined in the following manner: 

Generation Procedure of Fuzzy If-Then Rule 

1. Calculate /3ciass h{Rj) for Class h {h = 1, . . . ,C) as 

/^Class ^ ^ ■ • ■ • ■ : h — 1,2,..., C. (2) 

aJpG Class h 

2. Find Class h that has the maximum value of /3ciass h{Rj)'- 

I^ciass hi^j) ~ max{/3ciass l{Rj) , Pciass 2{Rj) , ■ ■ ■ , P Class c{Rj)}- (3) 

If two or more classes take the maximum value, the consequent class Cj of the 
rule Rj can not be determined uniquely. In this case, specify Cj as Cj = (f>. 
If a single class takes the maximum value, let Cj be Class h. 

3. If a single class takes the maximum value of /3ciass h{Rj), the grade of cer- 
tainty CFj is determined as 

CFj = ^Class ^ 

/3ciass h{Rj) 

where 

I] Pciass h{Rj) 

- ^ ^ 

c — 1 

The number of fuzzy if-then rules in a fuzzy rule-based classification system is 
dependent on how each attribute is partitioned into fuzzy subsets. For example, 
when we divide each attribute into three fuzzy subsets in a ten-dimensional 
pattern classification problem, the total number of fuzzy if-then rules is = 
59049. This is what is called the curse of dimensionality. The grade of certainty 
CFj can be adjusted by a learning alogrithm [11]. 

2.3 Fuzzy Reasoning 

By the rule generation procedure in 2.2, we can generate N fuzzy if-then rules 
in (1). After both the consequent class Cj and the grade of certainty CFj are 
determined for all the N fuzzy if-then rules, a new pattern x is classified by the 
following procedure [8]: 
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Fig. 2. Two-dimensional pattern classification problem 



Fuzzy reasoning procedure for classification 

1. Calculate aciass h(x) for Class h, j = 1,2, . . . ,C as 

ociass h{x) = max{/Xj(a:) • CFj\Cj = Class h,h = l,2,.. .,N}, 

h=l,2,...,C, (6) 

where 

* ■ ■ ■ * /rjn(^n). (7) 

2. Find Class h* that has the maximum value of aciass h{x): 

aciass h-(x) = max{ aciass i(a;), • ■ •, aciass c(®)}. (8) 

If two or more classes take the maximum value, then the classification of x 
is rejected (i.e., x is left as an unclassifiable pattern), otherwise assign x to 
Class hp. 

3 Assigning Weights 

The main aim of assigning weights is extract only necessary patterns for improv- 
ing the performance of fuzzy rule-based classification systems. Generalization 
ability in specific is our main focus. 

Let us consider a two-dimensional two-class pattern problem in Fig. 2. All of 
given patterns are shown in Fig. 2. 250 patterns were generated from each of two 
normal distributions: a mean (0,0) and a variance 0.3^ for Class 1, and (1,1) 
and 0.3^ for Class 2. Both distribution do not have any correlation between two 
attributes. We can see that the two classes overlap with each other. 

In Fig. 3, we show classification boundaries that are generated by fuzzy rule- 
based classification systems with two, three, four, and five fuzzy sets for each 
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Fig. 3. Classification bounraries by fuzzy rule-based classification systems 



attribute (see Fig.l). From these figures, we can see that the classificatio bound- 
aries are not diagonal when the number of fuzzy sets for each atribute is large 
(e.g., (d) in Fig. 3). 

In order to determine the weights of given training patterns, we count the 
number of patterns from the same class in their neighborhood. Let us denote 
the neighborhood size as Nsize- We examine Assize nearest patterns from each 
of given training patterns for determining the value of the weight. We use the 
following equation to determine the weight of the p-th given pattern Wp\ 









( 9 ) 



where iVsame is the number of given patterns from the same class as the p- 
th given pattern. The weight Wp of the p-th given pattern can be viewed as 
a measure of overlaps. That is, if the value of Wp is large, that means the p-th 
given pattern is surrounded by many patterns that are from the same class as p- 
th training pattern. On the other hand, the p-th given pattern is possibly an 
outlier if the value of Wp is low. Only given patterns that have higher weights 
than a prespecified threshold value are used as training patterns. In this paper, 
we denote the threshold as 9. 
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XI 

Fig. 4. Result of assigning weights (Nsize, 9) = (400, 0.5) 



In Fig. 4, we show the result of the weight assignment to the two-dimensional 
patterns in Fig. 2. We specified the value of iVsize as iVsize = 400 and the value 
of the threshold as 0 = 0.5. From Fig. 4, we can see that overlapping areas are 
smaller than that in Fig. 2. In Fig. 5, we also show the classification boundaries 
in the same way as in Fig. 2 but by using the patterns in Fig. 4. From Fig. 5, 
we can see that the classification boundaries are more diagonal than those in 
Fig. 3 when the number of fuzzy sets for each attribute is large. Simple shape of 
classification boundaries such as those in Fig. 5 can lead to high generalization 
ability. We examine the performance of fuzzy rule-based classification systems 
on unseen data in the next section. 

4 Computer Simulations 

In this section, we examine the performance of our proposed ensembling method. 
First we explain data sets which are used in our computer simulations to examine 
the effect of assigning weights on the performance of fuzzy rule-based classifi- 
cation systems. In this section, we only show classification results on unseen 
data set because our focus is only on whether the generalization ability can be 
improved by the weight assignment in Section 3. 



4.1 Performance Evaluation on a Two-Dimensional Data Set 

We use the two-dimensional two-class pattern classification problem in Section 
3. The data sets consists of 250 given training patterns from each class. We 
generate test patterns for evaluating the performance of fuzzy rule-based clas- 
sification systems after the weight assignment to training pattern. From each 
class, we generate 500 patterns as test data by using the same class distribution 
as described in Section 3. That is, first we assign a weight to each of train- 
ing patterns by using a prespecified neighborhood size Ng^e- Then, we generate 
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^1 

(d) Five fuzzy sets 



Fig. 5. Classification boundaries (Vsize,^) = (400,0.5) 



fuzzy if-then rules from those training patterns that have higher value of weight 
than a prespecified threshold value 9 . Test data are used to examine the per- 
formance of the generated fuzzy rule-based classification system. Table 1 shows 
the classification results of fuzzy rule-based classification system. We also show 
the performance of the conventional method (i.e., no weight is considered and 
all the given patterns are used to generate fuzzy if-then rules) in the last row of 
the table. From this table, we can see that generalization ability is not improved 
very well by the weight assignment. 

4.2 Performance Evaluation on Iris Data Set 

Iris data set is a four-dimensional three-class problem with 150 given training 
patterns [12]. There are 50 training patterns from each class. This data set is 
one of the most well-known pattern classification problems. Many researchers 
have applied their classification methods to the iris data set. For example, Weiss 
and Kulikowski [13] examined the performance of various classification methods 
such as neural networks and nearest neighbor classifier for this data set. Grabisch 
and Dispot [14] has also examined the performance of various fuzzy classification 
methods such as fuzzy integrals and fuzzy k-nearest neighbor for the iris data 
set. 
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Table 1. Classification results for the two-dimensional data set 





The number 


of fuzzy sets 


^ size 


2 


3 


4 


5 


1 


96.3% 


96.4% 


96.5% 


96.6% 


5 


96.3% 


96.3% 


96.5% 


96.4% 


10 


96.3% 


96.3% 


96.5% 


96.5% 


20 


96.3% 


96.3% 


96.5% 


96.4% 


50 


96.4% 


96.4% 


96.4% 


96.4% 


100 


96.4% 


96.4% 


96.4% 


96.5% 


200 


96.3% 


96.4% 


96.6% 


96.4% 


300 


96.3% 


96.4% 


96.5% 


96.4% 


400 


96.3% 


96.3% 


96.4% 


96.4% 


No Weight 


96.3% 


96.5% 


96.5% 


96.6% 



Table 2. Classification results for the iris data set 





The number 


of fuzzy sets 


^ size 


2 


3 


4 


5 


1 


68.7% 


93.3% 


90.0% 


95.3% 


5 


68.7% 


93.3% 


89.3% 


96.0% 


10 


67.3% 


93.3% 


88.7% 


96.0% 


20 


68.0% 


93.3% 


89.3% 


95.3% 


50 


69.3% 


90.7% 


90.7% 


96.0% 


No Weight 


67.3% 


93.3% 


89.3% 


95.3% 



We examined the performance of fuzzy rule-based classification systems on 
unseen data by using leaving-one-out method. In the leaving-one-out method, 
a single pattern is used as an unseen data and the other patterns are used to 
generate fuzzy if-then rules by the procedures in Section 2. Table 2 shows the 
classification results for the iris data. From this table, we can see that the gener- 
alizatoin ability of fuzzy rule-based classification systems for the two-dimensional 
two-class pattern classification system is improved by using our weight assign- 
ment. 



4.3 Performance Evaluation on Cancer Data 

The cancer data set is a nine-dimensional two-class pattern classification prob- 
lem. In Grabisch’s works [14, 15, 16], various fuzzy classification methods have 
been applied to cancer data set in order to compare each of those fuzzy classifica- 
tion methods. In the same manner as for the iris data set in the last subsection, 
we examined the performance of the proposed ensembling method for the cancer 
data set. The performance of our ensembling method on appendicitis data set 
is shown in Table 3. We can see from this table that by assigning weights to 
training patterns the performance of the fuzzy rule-based classification system 
is improved in its classification ability. However, the performance of the fuzzy 
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Table 3. Classification results for the cancer data set 





The number 


of fuzzy sets 


^ size 


2 


3 


4 


5 


1 


70.3% 


64.3% 


61.2% 


51.4% 


5 


75.2% 


68.5% 


66.1% 


58.0% 


10 


75.2% 


69.2% 


67.8% 


59.1% 


20 


73.8% 


70.3% 


67.5% 


59.8% 


50 


72.7% 


68.5% 


66.4% 


59.8% 


100 


69.6% 


65.7% 


64.0% 


58.0% 


200 


69.6% 


65.7% 


64.0% 


58.0% 


No Weight 


71.3% 


67.8% 


65.4% 


57.7% 



rule-based classification systems degrades if the neighborhood size Assize is large 
(e.g., Nsize = 200). This is because the number of Class 1 patterns is 85 and the 
larger neighborhood size than that value does not make a sense. 

5 Conclusions 

In this paper, we examined the effect of weight assignment on the performance 
of fuzzy rule-based classification systems. A weight for each given pattern is de- 
termined by the proportion of patterns from the same class in its neighborhood. 
Only those given patterns that have larger weights than a threshold value are 
considered in the generation of fuzzy rule-based classification systems. Thus, the 
behavior of our weight assignment is determined by two factors: the neighbor- 
hood size and the threthold value. 

We showed the performance of fuzzy rule-based classification systems with 
the weight assignment by comparing the performance without it. Through the 
computer simulations, we showed that the generalization ability is improved by 
increasing the neighborhood size. However, the performance of fuzzy rule-based 
classification systems can be degraded when the neighborhood size is unneces- 
sarily large. 

We only used a simplest version of fuzzy rule-based classificaiton systems 
in this paper. Our future works include the use of more sophisticated versions 
to show that generalization ability can be efficiently improved by using several 
techniques in the field of data mining, evolutionary computation, and so on. 
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Abstract. This paper deals with heuristic algorithm selection, which 
can be stated as follows: given a set of solved instances of a NP-hard 
problem, for a new instance to predict which algorithm solves it bet- 
ter. Eor this problem, there are two main selection approaches. The first 
one consists of developing functions to relate performance to problem 
size. In the second more characteristics are incorporated, however they 
are not defined formally, neither systematically. In contrast, we propose 
a methodology to model algorithm performance predictors that incor- 
porate critical characteristics. The relationship among performance and 
characteristics is learned from historical data using machine learning 
techniques. To validate our approach we carried out experiments using 
an extensive test set. In particular, for the classical bin packing prob- 
lem, we developed predictors that incorporate the interrelation among 
five critical characteristics and the performance of seven heuristic algo- 
rithms. We obtained an accuracy of 81% in the selection of the best 
algorithm. 



1 Introduction 

The NP-hard combinatorial optimization problems have been solved with many 
different algorithms. In a general way, non-deterministic algorithms have been 
proposed as a good alternative for very large instances and deterministic al- 
gorithms are considered adequate for small instances. However, no adequate 
method is known nowadays for selecting the most appropriate algorithm to solve 
a particular instance of this kind of problems. 

The algorithm selection problem is far away from being easily solved due to 
many issues. Particularly, it is known that in real-life situations no algorithm 

* This research was supported in part by CONACYT and COSNET. 
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outperforms the other in all circumstances [1]. But until now theoretical re- 
search has suggested that problem instances can be grouped in classes and there 
exists an algorithm for each class that solves the problems of that class most effi- 
ciently [2] . Consequently, few researches have identified the algorithm dominance 
regions considering more than one problem characteristic. However, they do not 
identify formally and systematically the characteristics that affect performance 
in a critical way, and do not incorporate them explicitly in a performance model. 

We have several years working on the problem of data-object distribution 
on the Internet, which is a generalization of the bin packing problem. We have 
designed solution algorithms and carried out a large number of experiments with 
them. As expected, no algorithm showed absolute superiority. Hence, we have 
also been working in developing an automatic method for algorithm selection. 
In this paper we show a new methodology for the characterization of algorithm 
performance and their application to algorithm selection. 

The proposed methodology consists of three phases: initial training, predic- 
tion and retraining. The first phase constitutes the kernel of the selection process. 
In this phase, starting from a set of historical data solved with several algorithms, 
machine learning techniques are applied, in particular clustering and classifica- 
tion, to learn the relationship among the performance and problem character- 
istics. In the prediction phase we apply the relationship learned to algorithm 
selection for a given instance. The purpose of retraining phase is to improve the 
accuracy of the prediction with new experiences. 

This paper is organized as follows. An overview of the main works on algo- 
rithm selection is presented in Section 2. An application problem and its solution 
algorithms are described in Section 3, in particular we use the bin packing (BP) 
problem and seven heuristic algorithms (deterministic and non-deterministic) . 
Then, Section 4 describes a proposed methodology to characterize algorithm 
performance and select the algorithm with the best-expected performance; de- 
tails of the application of our characterization mechanism to the solution of BP 
instances are described too. 

2 Related Work 

A Recent approach for modeling algorithm performance incorporates more than 
one problem characteristic in order to increase the precision of the algorithm 
selection process. The works described below follow this approach. 

Borghetti developed a method to correlate each instance characteristic to 
algorithm performance [3] . An important shortcoming of this method is that it 
does not consider the combined effect of all the characteristics. Fink developed 
a selection method based on the estimation of the algorithm gain: an instances 
taxonomy is defined by the user, and for each class the performance statistics 
are stored; for a new instance its class is determined and the algorithm which 
solve it better is predicted using statistical data [4] . The METAL team proposed 
a method to select the most appropriate algorithm for a classification task: for 
a new problem the data characteristics are computed, which are used to choose 
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Table 1. Related work on algorithm performance characterization 



Research 


Characteristics 

definition 


Critical 

characteristics 

modeling 


Instances 
grouping 
by similar 
characteristics 


Critical 
characteristics 
learned into 
performance model 


Fink 


V 




^ (informal) 




Borghetti 


%/ 


%/ 






METAL 


%/ 


%/ 






Our proposal 


V 


V 


^ (formal) 


t/ 



from a base case the most similar classification problems; finally the best algo- 
rithm is determined using the past experienced from applying the algorithms to 
similar problems [5]. 

Table 1 presents the works described before. Column 2 shows if several char- 
acteristics are defined. Column 3 shows if integrated characteristics are critical 
for performance and if they are defined in formal and systematic way. Column 
4 indicates if instances are grouped by similar characteristics. Column 5 is used 
to indicate if the relationship among performance and critical characteristics is 
learned, from past experiments, into a performance model. Notice that no work 
includes all aspects required to characterize algorithm performance aiming at 
selecting the best algorithm for a given instance. In contrast, our method is the 
only one that considers these four main aspects. 

3 Application Problem 

The bin packing problem is used for exemplifying our algorithm selection 
methodology. In this section a brief description of the one-dimensional bin pack- 
ing problem and its solution algorithms is made. 

3.1 Problem Description 

The Bin Packing problem is an NP-hard combinatorial optimization problem, 
in which there is a given sequence of n items L = {oi, 02 , ..., On} each one with 
a given size 0 < s(ai) < c, and an unlimited number of bins each of capacity c. 
The question is to determine an L minimal partition Bi,B 2 -, such that 

in each bin Bi the aggregate size of all the items in Bi does not exceed c. This 
constraint is expressed in (1). 

s{ai) <c Vj, 1 < j < m (1) 

OiGSj 

In this work, we consider the discrete version of the one-dimensional bin 
packing problem, in which the bin capacity is an integer c, the number of items 
is n, and each item size is s^, which is chosen from the set {1, 2, . . . , c}. 
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3.2 Heuristic Solution Algorithms 

An optimal solution can be found by considering all the ways to partition a set of 
n items into n or fewer subsets, unfortunately the number of possible partitions 
is larger than {n/2)'^/^ [6]. The heuristic algorithms presented in this section use 
deterministic and non-deterministic strategies for obtaining suboptimal solutions 
with less computational effort. 

Deterministic algorithms always follow the same path to arrive at the solu- 
tion. For this reason, they obtain the same solution in different executions. The 
approximation deterministic algorithms for bin packing are very simple and run 
fast. A theoretical analysis of these algorithms is presented in [7]. In this work, 
we used the following five algorithms: First Fit Decreasing (FFD), Best Fit De- 
creasing (BFD), Match to First Fit (MFF), Match to Best Fit (MBF), Modified 
Best Fit Decreasing (MBFD). 

Non-deterministic algorithms generally do not obtain the same solution in 
different executions. Approximation non-deterministic algorithms are considered 
general purpose algorithms. We used the following two algorithms: Ant Colony 
Optimization (AGO) [8], and Threshold Accepting (TA) [9]. 

4 Automation of Algorithm Selection 

In this section a new methodology is presented for characterizing algorithm per- 
formance based on past experience. This characterization is used to select the 
best algorithm for a new instance of a given problem. Section 4.1 describes, in 
a general way, the proposed methodology. In next sections, the phases of the 
methodology are described and exemplified with the one-dimensional bin pack- 
ing problem presented in Section 3. 

4.1 Methodology for Characterization and Selection 

The methodology proposed for performance characterization and its application 
to algorithm selection consists of three consecutives phases: 

Initial Training Phase. The relationship between performance and problem char- 
acteristics is learned from an initial sample of instances solved with several al- 
gorithms. 

Prediction Phase. The relationship learned is used to predict the best algorithm 
for a new given instance. 

Training with Feedback. The new solved instances are incorporated to the charac- 
terization process for increasing the selection quality. The relationship learned is 
improved with a new set of solved instances, and it is used again in the prediction 
phase. 
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Fig. 1. Steps of the initial training phase 



4.2 Initial Training Phase 

The steps of this phase are shown in Figure 1. In the step 1 (Characteristics 
Modeling) are derived indicator expressions for measuring the influence of prob- 
lem characteristics on algorithm performance. The step 2 (Statistical Sampling) 
generates a set of representative instances of an optimization problem. In the 
step 3 (Characteristics Measurement) the parameter values of each instance are 
transformed into characteristic values. In the step 4 (Instances Solution) the in- 
stances are solved using a configurable set of heuristic algorithms. The step 5 
(Clustering) integrates groups constituted by instances with similar character- 
istics, and for which an algorithm had a better performance than the others. 
Finally, in the step 6 (Classification) the identified grouping is learned into for- 
mal classifiers, which are predictors that model the relationship between problem 
characteristics and algorithm performance. 



Step 1. Characteristics Modeling. Relevant features of the problem param- 
eters were identified, and expressions to measure the values of identified critical 
characteristics were derived. The characteristics identified using a common rec- 
ommendation were: instance size p, item size dispersion d, and number of fac- 
tors /. The characteristics identified using parametric analysis were: constrained 
capacity t and bin usage b. Once the critical characteristics were identified, five 
expressions (2 to 6) to measure their influence on algorithm performance were 
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derived. The factor analysis technique was used to confirm if derived indicators 
were critical too. 



n 

P= 

maxn 



i 

n 



d = a{t) 



Instance size p expresses a re- 
lationship between instance size 
and the maximum size solved, 
where: (2) 

n = number of items, 
maxn = the maximum size solved 
( 1000 ). 

Constrained capacity t quantifies 
the proportion of the bin capac- 
, ^ ^ ity that is occupied by an item of 

average size, where: 

Si = size of item i, 
c = bin capacity. 



Item dispersion d expresses the 
dispersion degree of the item size 
values. It is measured using the 
standard deviation of t. 



( 4 ) 



/ 



Y, factor {c, Si) 

i 

n 



Number of factors f expresses the 
proportion of items whose sizes 
are factors of the bin capacity. An 
item is a factor when the bin ca- 
pacity c is multiple of its corre- 
sponding item size Si. 



( 5 ) 



b 



1 




The bin usage b expresses the 
*/ proportion of the total size that 

^ ^ ^ can fit in a bin of capacity c. , , 

otherwise ~ ~ 4 ^^® inverse of this metric is used ^ 

to calculate the theoretical opti- 
mum. 



Step 2. Statistical Sampling. In order to ensure that all problem character- 
istics were represented in the instances sample, stratified sampling and a sample 
size derived from survey sampling were used. The formation of strata is a tech- 
nique that allows reducing the variability of the results, increasing the represen- 
tativeness of the sample, and can help ensure consistency especially in handling 
clustered data [10]. Specifically, the following procedure was used: calculation 
of the sample size, creation of strata, calculation of the number of instances for 
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Table 2. Example of random instances with their characteristics and the best algo- 
rithms 



Instance 


1 Characteristics indicators 


Real best 
algorithms 


Problem 
size p 


Bin size 
b 


Item size 
t 


Factors 

/ 


Item 

dispersion 

d 


Elil0.txt 


0.078 


0.427 


0.029 


0.000 


0.003 


FED, BFD 


E50il0.txt 


0.556 


0.003 


0.679 


0.048 


0.199 


FFD, BFD 


E147il0.txt 


0.900 


0.002 


0.530 


0.000 


0.033 


TA 


E162il0.txt 


0.687 


0.001 


0.730 


0.145 


0.209 


FFD, BFD 


E236il0.txt 


0.917 


0.002 


0.709 


0.000 


0.111 


TA 



each stratum, and random generation of the instances for each stratum. With 
this method 2,430 random instances were generated. 



Step 3. Characteristics Measurement. For each instance of the sample, 
its parameter values were substituted into expression indicators for getting its 
critical characteristic values. Table 2 shows the characteristic values obtained 
for a small instance set, which were selected from a sample with 2,430 random 
instances. 



Step 4. Instance Solution. The 2,430 random instances were solved with 
the seven heuristic algorithms described in Section 3.2. The performance results 
obtained were: theoretical ratio and execution time. Theoretical ratio is one of 
the usual performance metrics for bin packing and it is the ratio between the 
obtained solution and the theoretical optimum (it is a lower bound of the optimal 
value and equals the sum of all the item sizes divided by the bin capacity). For 
each sample instance, all algorithms were evaluated in order to determine a set of 
best algorithms (see the column 7 of Table 2) . We defined the following criterions 
for algorithm evaluation: the theoretical ratio has the largest priority, two values 
are equals if they have a difference of 0.0001. 



Step 5. Instances Clustering. K-means was used as a clustering method 
to create similar instance groups dominated, each one, by an algorithm. The 
cluster analysis was carried out using the commercial software SPSS version 
11.5 for Windows. The similarity among members of each group was determined 
through: characteristics indicators of the instances and the number assigned 
to the algorithm with the best performance for each one. When an instance has 
several best algorithms is needed to test them one by one, until k-means allocates 
that instance with other instances that have the same best algorithm. With this 
strategy we could transform a set of overlapped groups to a set of disjunction 
groups. Four groups were obtained; each group was associated with a similar 
instances set and an algorithm with the best performance for it. Three algorithms 
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Fig. 2. Steps of the prediction phase 



had poor performance and were outperformed by the other four algorithms. This 
dominance result applies only to the instances space explored in this work. 



Step 6. Classification. In this investigation the C4.5 method was used as 
a machine learning technique to find out the relationship among the problem 
characteristics and algorithm performance. The C4.5 method is a greedy al- 
gorithm which builds a decision three from the training dataset, the three is 
converted to a set of classification rules, the rules are ordered by accuracy and 
are applied in sequence for classify each new instance in the corresponding group. 
The percentage of new correctly classified observations is an indicator of the ef- 
fectiveness of the classification rules. If these rules are effective on the training 
sample, it is expected that with new observations whose corresponding group is 
unknown, they will classify well. The classification analysis was made using the 
implementation available in Weka System [11]. For obtaining the classification 
rules, five indicators were used as independent variable, and the number of the 
best algorithm as class variable. The classifier was trained with 2,430 random 
bin packing instances. 

4.3 Prediction Phase 

The steps of this phase are shown in Figure 2. For a new instance, the step 7 
(Characteristics Measurement) calculates its critical characteristic values using 
the indicator functions. The step 8 uses the learned classifiers to determine, from 
the characteristics of the new instance, which group it belongs to. The algorithm 
associated to this group is the expected best algorithm for the instance. 
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Table 3. Example of standard intances with its characteristics and the best algorithms 



Instance 


Characteristics indicators 


Real 

best 

algorithms 


Problem 
size p 


Bin size 
b 


Item size 
t 


Factors 

/ 


Item 

dispersion 

d 


Hard0.txt 


0.200 


0.183 


0.272 


0.000 


0.042 


AGO 


N3clwl_t.txt 


0.200 


0.010 


0.499 


0.075 


0.306 


FFD, BFD, MBF 


T60_19.txt 


0.060 


0.050 


0.333 


0.016 


0.075 


MBF, MFF 


U1000_19.txt 


1.000 


0.003 


0.399 


0.054 


0.155 


AGO 


Nlw2b2r0.txt 


0.050 


0.102 


0.195 


0.020 


0.057 


MBF, MFF 



Table 4. Classification results with 1,369 standard instances 



Instance 


Best Algorithms 


Match 


Real 


Predicted 


1 


AGO 


AGO 


1 


90 


FFD, BFD 


FFD 


1 


264 


AGO 


BFD 


0 










1369 


AGO 


AGO 


1 




Accuracy 


81% 



Step 7. Characteristics Measurement. To validate the effectiveness of the 
algorithm performance predictor learned with C4.5 method, we collected stan- 
dard instances that are accepted by the research community. In this step we 
used them as new instances. For most of them, the optimal solution is known; 
otherwise the best-known solution is available. Four types of standard bin pack- 
ing instances were considered. The Beasley’s OR-Library contains two types of 
bin packing problems (w instances, t instances). The Operational Research Li- 
brary contains problems of two kinds: N instances and hard instances. All of 
these instances are thoroughly described in [12]. Table 3 presents a fraction of 
the 1,369 instances collected. For each instance, the indicators functions were 
used to determine its critical characteristic values (columns 2 to 6). Besides, 
they were solved with the purpose to contrast for each instance, the selected 
algorithm against the real best algorithms; column 7 shows the last one. 

Step 8. Selection. The Learned classifier was used to predict the best algorithm 
for each one of the 1,369 standard instances. Table 4 presents the validation 
results of the obtained classifier. For each instance, this table shows the best 
algorithms; column 2 is for the real best algorithms and column 3 corresponds 
to the algorithm selected by prediction. If the predicted algorithm is one of the 
real best algorithms, the match is counted. The classifier predicted the right 
algorithm for 81% of the standard instances. 
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4.4 Training with Feedback Phase 

The steps of this phase are shown in Figure 3. The objective is to feedback the 
selection system and maintain it in a continuous training. For each new charac- 
terized instance, whose best algorithm was selected in the prediction phase, the 
step 9 (Instance Solution) solves it and obtains the real best algorithms. After 
that, step 10 (Patterns Verification) compares the algorithm selected against the 
real best algorithms, if the prediction is wrong and the average accuracy is out 
of an specified threshold, then the classifiers are rebuilt using the old and new 
dataset; otherwise the new instance is stored and the process ends. 

5 Conclusions and Future Work 

In this article, we propose a new approach to solve the selection algorithm prob- 
lem. The main contribution is a methodology to model algorithms performance 
predictors that relate performance to problem characteristics, aiming at select- 
ing the best algorithm to solve a specific instance. The relationship is learned 
from historical data using machine learning techniques. With this approach it is 
possible to incorporate more than one characteristic into the models of algorithm 
performance, and get a better problem representation than other approaches. 

For test purposes 2,430 random instances of the bin packing problem were 
generated. They were solved using seven different algorithms and were used for 







80 



Joaqum Perez O. et al. 



training the algorithm selection system. Afterwards, for validating the system, 
1,369 standard instances were collected, which have been used by the research 
community. In previous work we obtained an accuracy of 76% in the selection 
of the best algorithm for all standard instances. The experimental results of this 
paper shows an improved accuracy of 81% using C4.5 as classification method 
and refined grouping. 

The systematic identification of five characteristics that influence algorithm 
performance for the bin packing problem, was crucial for obtaining the results 
accuracy. We consider that the principles followed in this research can be applied 
for identifying critical characteristics of other NP-hard problems. 
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Abstract. For network traffic analysis and forecasting, a novel method based 
on fuzzy association rules is proposed in this paper. Connecting fuzzy logic 
theory with association rules, the method sets up the fuzzy association rules and 
could analyze the traffic of the global network by using data mining algorithm. 
Therefore, this method can represent the traffic's characters much more 
precisely and forecast the behaviors of traffic in advance. The paper firstly 
introduces the new classification method on network traffic. Then the fuzzy 
association rules are applied to analyze the behaviors of traffic in existence. 
Finally, the results of simulation experiments indicating that the fuzzy 
association rule is very effective in discovering the relativity of different traffic 
in the analysis of traffic flow are shown. 



1 Introduction 

Network traffic is made up of data packages with same network attributes. The design 
of network traffic is an aspect of the network engineering, and the objective is to 
optimize the network capability [1]. Through the analysis of network traffic, 
optimization of its capability, and balancing the traffic load of the links, routers, and 
switches, we can make more effective use of the resources of the whole network. 

The network traffic analysis is a representation of the traffic flow design. By 
analyzing all behaviors of the traffic, we can find the characters of the traffic and 
make an objective evaluation of the network capability, which can be used as the 
basis of the control of the network traffic. The current main methods are queue 
method [2] [3], Poisson method [2] and Markovian method [3], etc. All of these 
methods are based on the simulation of the ways of the network data packages' 
arrival. They are also simulation or emulation models of traffic analysis. The analysis 
of traffic flows as analysis units and using data mining method, however, are only 
seen in paper [4]. 

There is some relativity when some traffic is generated. How to find the 
association of all-various simultaneous traffic, research the traffic behavior and 
predict the network's intending condition? For these purposes, we connect fuzzy logic 
theory with association rules and propose a novel method based on fuzzy association 
rules. 

There are advantages to analyze network traffic with the fuzzy association rules. 
First, there is a high volume of traffic and the large number of traffic records, and it 
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can supply a mass amount of data to analyze. Second, there are some association 
relations in the traffic. For example, a kind of traffic is often simultaneous with the 
other. At the same time, we can evaluate the network's capability with the fuzzy 
association rules from macro point. The analyzing unit of the traffic is not a data 
package but the volume of the network jobs. Finally, we use the fuzzy method to 
analyze the network traffic. We take the traffic of different time interzone as a record, 
and avoid the problem that the recording granularity is too small. 

The paper is organized as follows. Section 2 introduces the classification of 
network traffic and transaction. Section 3 introduces the fuzzy association rules and 
our algorithm, and an example using our algorithm is explained in Section 4. The 
conclusion is drawn in Section 5. 



2 Classification of Network Traffic and Transaction 

The fuzzy association rules search relationships in a transactional database. The 
analysis of the network traffic flow is not simply stated clearly by a transactional 
database. While describing network behavior, we can use the traffic flow, zone of 
measuring time and measuring points to reflect the traffic's different association 
relations in different measuring point and in different measuring time zone. 

We must have a lot of data for searching association relationship. And this data 
must save in a transactional database. Every record means a transaction. A 
transactionID and an ItemSet usually constitute a transactional database, and these 
make up the traffic items. We should firstly consider the form of a transaction item 
when we establish a transactional database. It means that we should classify the 
network traffic, and establish a record that is made of traffic. 

2.1 Classification of Network Traffic 

The main factors of measuring and analyzing network traffic are: the source IP 
address, the destination IP address, the source IP port, the destination IP port, the 
protocol type, the begin time, and the end time. The traffic classification is mainly 
according to the protocol type and the begin time on the fuzzy association rules 
analysis of the network traffic. 

In addition, the traffic still needs to define a time partition Af . Usually, analysts 
can define it according to their demand. While the data package is transmitting and 
the time interval of the adjacent data package is less than Af , it is called the traffic 
flow is active. If the time interval of the adjacent data package is more than or equal 
to Af , it shows that the former traffic terminates and the next traffic starts. 

According to these factors, we can classify the traffic in the next factors: 

- Protocol type 

- Source and destination IP address 

- Communication port 

- Interval time 

- Flow grade or duration 
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Firstly, the protocol type is the base of traffic classification. Different protocol type 
denotes different traffic. 

In Internet, every interface has an IP address. If the division of the different traffic 
is according to the source and the distance IP address, every IP address is classified as 
one category, and the classification granularity would be too refined. It adds more 
mining difficulty and the mining result is not any useful. We should use subnet 
address to classify the different traffic. When the source and distant subnet address is 
the same, we consider the two traffic's IP addresses the same. There are three kinds of 
traffic according to network condition. First, when the source and the destination IP 
addresses are the same, we think they are the same traffic. Second, only the sources 
are same. Third, only the destinations are same. Generally, the second and the third 
are the main aspects. In the following section, we only analyze the kind of traffic with 
the same source IP addresses. 

Also, the communication port is a factor of the network traffic analysis. If the port 
is different, the traffic is different, and the traffic volume will increase with port 
number of the power 2. 

The interval time is a necessary factor of the network traffic classification. We 
classify data package into different traffic using the time interval Af . 

If the division is according to the duration, it should be the same kind of traffic 
into different traffic because of the network bandwidth is different. This division 
method is useless. Currently, we divide the traffic into different grade according to 
flow. Three classes are: little, middle, large. 

We take all above methods as consideration. When the protocol is the same, it will 
bring 10*^ traffic types. It will supply plenty of mining data. If, however, we take all 
the subnet in the analysis in application, it would be useless. So we take the observing 
scope into account to mine the fuzzy association rules of a subnet to another. 
Otherwise, because the fixed port can denote different network protocol, so the port is 
useless. Thereby, we prescribe: if all protocols are the same, all subnets addresses are 
the same and the flow grades of network are the same too. 

2.2 Classification of the Network Transaction 

The transactional database is constituted with singly transactional data. Several 
transactional items make up a transaction. The classification of the network 
transaction is artificial classification in one observation. 

Firstly, we use the traffic classification method to divide all packages into single 
traffic by time interval Af . 

In order to mining the fuzzy association rales, we divide the observational time 
into some time slices that are very small, continuous and the same size. The traffic 
happens in a time slice and may end in others. We take this traffic into a traffic item 
set. 

A traffic item set's beginning time is the beginning time of the time slice, and its 
ending time is denoted as the ending time of the ultimate traffic in the traffic item set. 

According to the above analysis, we compare a shopping basket database with the 
traffic databases, see table 1. 
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Table 1. A shopping database and the traffic database 





Shopping database 


Traffic database 


Record 


Someone's merchandise set in his 
shopping basket in once shopping 


Traffic that had happened in one 
same time slice 


Item field 


Merchandise set 


Traffic set 


Item 


Merchandise 


Traffic 


Timezone 


Shopping time 


Time of transport traffic item sets 



3 Fuzzy Association Rules and Related Calculation Ways 
of Network Traffic 

3.1 Definition of Fnzzy Association Rnles 

The definition of the fuzzy association rules [5]: 
= (r/(/-,(aj,v,j),(a 2 >y 2 ),...,(«„,v,^))is a transactional database. Thereinia^is 

an attributer of a transaction; v,.,. is the value of a , . U:, is a membership degree, 
Vy will be respect with it. The minimum membership is respected with 8 . The smaller 
isf, the smaller is the relation to the usefulness of the support to Tid^ . The bigger 
is £ , the bigger is the relation to the usefulness of the support to Tid- . 



Support degree is: 

sup(d ^B) = sup(d A 5) = [^ (d) I id) >£)]/n 

deD 



( 1 ) 



Confidence degree is: 



conf (A^ B) = sup(^ aB)/ sup(^ U 5) = 





J}BAAB(d)\UA^B(d)>£) 

_deD 




YjBA(d)dA(d)>£) 

_dsD 


+ 


^(/4(d)K(d)>e) - 

_t/€D 


1- TjBA.B(d)fi^Ad)>£) 

_dsD 



Thereinto: 

fi-AAB = id) I X; G J U b\= a id) is the contributing of the record d 

' xiGA[}B ' 

to B , A denote get the least value, n denotes whole transaction number in a 
transaction set. 
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3.2 Algorithm of Fuzzy Association Rules of Traffic Analysis 

The algorithm of fuzzy association rules is divided into two parts as association rules 
[6]: first, find out all item sets for the minimum support degree - the frequency item. 
Second, export the rules using the frequency items. For the analysis of traffic, the 
steps of fuzzy association rules use the method that is in paper [7]. The membership 
function is: 

First, we divide the traffic into three fuzzy regions, such as little, middle, and large 
with k-middle point algorithm [8]. Second, we calculate the three middle values. 
Then, construct the fuzzy sets and membership function. In the definition of paper 
[9], we can get: 

The clustering middle point of fuzzy set F. is r, (/ = 1,2, . . . , A: — 1) , the large 
borderline is , the small borderline is . 

B. = r, + 0.5(1 + p){r.^, - r. ) • (3) 



b, = r- - 0.5(1 + p)(r. - ) ■ (4) 

The region of a quantitative traffic is [L,B], M = is middle 

point of clustering. We construct grads fuzzy membership function. The overlap of 
two boundary of fuzzy set is p . The membership function is the fig.l. 

According to grads membership function construction method [9] [10], we can 
obtain three- membership functions. 



1 . The membership function that's clustering middle point is : 

1 Z < M < hj 

M,malM) = \ ^ F<U<B, 

B, -F 

0 B^<u 



(5) 



2. The membership function that's clustering middle point is : 



M middle ~ 





0 


u ■ 


-Z>2 


F 


-Z>2 


B, 


, -u 


F 


-F 




1 



L<u<b 2 ,B 2 <u<R 
<u <B^ 

F < M < Rj 
5, <u<b-2 



3. The membership function that's clustering middle point is : 

0 L <u<b2 

, ^ u-b, , „ . 

M„r^e(u) = \- ^ F<U<B2 

1 B2<u< L 



( 6 ) 



( 7 ) 
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Fig. 1. Grads fuzzy membership function 



4 An Example 

Suppose we only eonsider the eommunieation of three subnets in eampus and mine 
their fuzzy assoeiation rules. 

Three subnets are: 202.117.49.1, 202.117.48.1, and 202.117.21.1. The types of 
protoeols are: HTTP, FTP, and TELNET. The formation of the traffie adopts the 
method that deseribes in seetion 2.2. The observation time is an hour. One time sliee 
is 400ms. 



The first step: Classification of the Traffic 

Take HTTP protoeol as an example, we regard three subnets as its souree IP address. 
And it is divided into three types: HTTPl, HTTP2, and HTTP3. 

With the flow of the protoeol, it is divided into three elasses: little, middle, large. 
HTTPl is divided into: HTTPllittle, HTTPlmiddle and HTTPllarge. HTTP2 and 
HTTP3 also adopt the method, and then the HTTP is divided into 9 kinds of traffie 
(see Table 2). HTTP, FTP and TELNET are divided into 27 kinds of traffie. We ean 
eonstitute a great of traffie item to data mining. 

Table 2. HTTP traffic classification 



Subnet Protoeol elassifieation Traffie elassifieation 

202.117.49.1 HTTPl HTTPllittle, HTTPlmiddle, HTTPllarge 

202.117.48.1 HTTP2 HTTP21ittle, HTTP2middle, HTTP21arge 

202.117.21.1 HTTP3 HTTP31ittle, HTTP3middle, HTTP31arge 



The second step: The Classification of the Transaction 

One transaetion is an analyst's one observation. First we should merge the times, and 
then fill in the traffie database with traffie and its flow that have been observed. 

In one observation, we will get the different traffie. We reeord time zone, traffie 
and flow in table (Table 3 is primitive parts traffie flow fable). 
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Table 3. Primitive parts traffic flow table 



Timezone(s) 


Traffic 


Flow (Mb) 


[0, 65] 


HTTP2 


300 


[10, 50] 


FTPl 


500 


[20, 620] 


FTP3 


2000 


[15, 500]] 


TELNET2 


40 


[50, 200] 


TELNET3 


30 


[80, 3000] 


FTP2 


1500 


[300, 500] 


HTTPl 


80 


[800, 1200] 


HTTP3 


120 


[2400, 3600] 


FTP2 


2000 


[1450, 1680] 


FTP3 


700 



Divide the each traffic into traffic item, see table 4. 

Table 4. Traffic flow table 



TID Timzone(s) 




HTTP 






FTP 




TELNET 










HTTPl 


HTTP2 


HTTP3 


FTPl 


FTP2 


FTP3 


TELNET 1 


TELNET2 TELNET3 


T1 


[0,3000] 


80 


300 


50 


500 


1500 


200 


50 


40 


30 


T2 


[400,680] 


120 


56 


60 


650 


780 


20 


8 


9 


20 


T3 


[800,1200] 


80 


200 


120 


54 


75 


89 


0 


15 


6 


T4 


[1200,1680] 


100 


30 


45 


65 


120 


700 


50 


15 


30 


T5 


[1600,2800] 


70 


80 


90 


780 


625 


205 


20 


150 


60 


T6 


[2000,2320] 


35 


300 


40 


41 


60 


1500 


85 


35 


4 


T7 


[2400,2950] 


60 


50 


90 


0 


2000 


45 


5 


125 


35 


T8 


[2800,3000] 


85 


35 


16 


0 


0 


58 


45 


86 


130 


T9 


[3200,3600] 


46 


73 


95 


150 


75 


0 


20 


75 


0 



We analyze all the traffic of HTTP and partition all the traffic into three fuzzy sets. 
The left boundary of flow region is L=25, the right boundary of flow region is 
R=300. The three clustering centre are ml=44, m2=87.5, m3=192. The large 
boundary of regionl is Bl=72.275, the small boundary of region2 is b2=59.225 and 
the large boundary is B2=155.425, the small boundary ofregion3 is b3=124.075. We 
could obtain the membership degree of all flow in their fuzzy scope. See table 5. 

Table 5. The membership degree of all traffic (The part table of HTTP) 



TID 


Timezone(s) 


HTTPl 




HTTP2 


HTTP3 








little middle large little middle large little middle large 


T1 


[0,3000] 


0 


1 


0 


0 


0 


1 1 


0 


0 


T2 


[400,680] 


0 


1 


0 


1 


0 


0 0.92 


0.08 


0 


T3 


[800,1200] 


0 


1 


0 


0 


0 


1 0 


1 


0 


T4 


[1200,1680] 


0 


1 


0 


1 


0 


0 1 


0 


0 


T5 


[1600,2800] 


0.15 


0.85 


0 


0 


1 


0 0 


1 


0 


T6 


[2000,2320] 


1 


0 


0 


0 


0 


1 1 


0 


0 


T7 


[2400,2950] 


0.92 


0.08 


0 


1 


0 


0 0 


1 


0 


T8 


[2800,3000] 


0 


1 


0 


1 


0 


0 1 


0 


0 


T9 


[3200,3600] 


1 


0 


0 


0 


1 


0 0 


1 


0 


Cardinality 




3.07 


5.93 


0 


4 


2 


3 4.92 


4.08 


0 
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4.1 Calculate Frequency 1-Item Sets 

The minimum support count is min_sup=4, and frequency 1 -item sets are in table 6. 



Table 6. Frequency 1 -item sets 



Sequence 


Item set 


Support 


1 


HTTPlmiddle 


5.93 


2 


HTTP21ittle 


4 


3 


HTTPS little 


4.92 


4 


HTTPSmiddle 


4.08 


5 


FTPl little 


6 


6 


FTP2 little 


4 


7 


FTPSlittle 


7 


8 


TELNET 1 little 


5.73 


9 


TELNET21ittle 


5 


10 


TELNETSlittle 


7 



4.2 Calculate Frequency 2-Item Sets 

(1) Calculate Candidate 2-Item Sets. We calculate the membership degree of 
(HTTPlmiddle, HTTP21ittle) as example. The other methods are same. The 
membership degree of (HTTPlmiddle, HTTP21ittle) equals min {the membership 
degree of HTTPl middle, the membership degree of HTTP21ittle }, see table 7. 

Table 7. Membership degree of (HTTPlmiddle, HTTP21ittl) 



TID 


HTTPlmiddle 


HTTP21ittle 


Membership 


T1 


1 


0 


0 


T2 


1 


1 


1 


T3 


1 


0 


0 


T4 


1 


1 


1 


T5 


0.85 


0 


0 


T6 


0 


0 


0 


T7 


0.08 


1 


0.08 


T8 


1 


1 


1 


T9 


0 


0 


0 


Cardinality 






3.08 



(2) Get the Frequency 2-item Set. See tableS. 

Table 8. Frequency 2-item set 

Sequence Frequency 2item set 

1 (HTTPlmiddle, FTPS little) 

2 (HTTPlmiddle, TELNET21ittle) 

3 (HTTPlmiddle, TELNETSlittle) 

4 (HTTPSmiddle, FTPSlittle) 

5 (HTTPSmiddle, TELNET 1 little) 



Support 

4.93 

4 

4.08 

4.08 

4.08 
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6 


(FTP 1 little, FTP21ittle) 


4 


7 


(FTPllittle, FTP31ittle) 


4 


8 


(FTPllittle, TELNET31ittle) 


5 


9 


(FTP31ittle, TELNET 1 little) 


5.6 


10 


(FTP31ittle, TELNET31ittle) 


5 


11 


(TELNET 1 little, TELNET3 little) 


4.26 


12 


(TELNET21ittle, TELNET31ittle) 


5 



4.3 Calculate Frequency 3-Item Set 

The method is same as section 4.2, we get the frequency 3-item set, see table 9. 

Table 9. Frequency 3-item set 



Sequence Frequency 3-item set Support 

1 (HTTPlmiddle, TELNET21ittle, TELNET31ittle) 4 

2 (HTTP3middle, FTP31ittle, TELNET 1 little) 4.08 

3 (FTP31ittle, TELNET 1 little, TELNET3 little) 4.13 



4.4 Calculate Frequency 4-Item Set 

The membership degree of (HTTP3middle, FTP31ittle, TELNET 1 little, 

TELNET31ittle) is in tablelO. 

Table 10. The membership degree of (FlTTP3middle, FTP31ittle, TELNET 1 little, 
TELNET31ittle) 



TID 


HTTP3 

middle 


FTP3 

little 


TELNET 1 
little 


TELNET3 

little 


Membership 


T1 


0 


1 


0.13 


1 


0 


T2 


0.08 


1 


1 


1 


0.08 


T3 


1 


1 


1 


1 


1 


T4 


0 


0 


0.13 


1 


0 


T5 


1 


1 


1 


0 


0 


T6 


0 


0 


0 


1 


0 


T7 


1 


1 


1 


1 


1 


T8 


0 


1 


0.47 


0 


0 


T9 


1 


1 


1 


1 


1 


Cardinality 








3.08 



The support degree of the candidate 4-item set is 3.08, it does not satisfy the 
minimum support degree that is min_sup>=4, so there is no candidate 4-item set. The 
candidate fuzzy association rulers are in table 1 1 . 
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Table 11. The candidate fuzzy association rulers 

Sequence The candidate fuzzy association rulers 

1 HTTPl middle a TELNET2 little ^ TELNETS little 

2 HTTP 1 middle a TELNETS little ^ TELNETS little 

3 TELNETS little a TELNETS little ^ HTTP 1 middle 

4 HTTPS middle a FTPS little ^ TELNET 1 little 

5 HTTPS middle A TELNETllittle ^ FTPS little 

6 TELNET llittle a FTPS little ^ HTTPS middle 

7 FTPS little a TELNET 1 little => TELNETS little 

8 FTPS little a TELNETS little ^ TELNET 1 little 

9 TELNETSlittle a TELNET llittle ^ FTPS little 



4.5 Calculate the Confidence 

(1) While HTTPl is middle, TELNETS is little and TELNETS is little, the 
confidence is: 

HTTP Imiddle n TELNET2//tt/e n TELNET3/ht/e _ 4 _ 

HTTP Imiddle n TELNET2/;ft/e ~ 4~ 

(2) While HTTP3 is middle, FTPS is little and TELNETl is little, the confidence is: 

HTTP3middle n FTP3littie n TELNETllittle 4.08 

= 100 % 



WTTP3middle n TTP3little 4.08 

(3) While FTPS is little, TELNETl is little and TELNETS is little, the confidence is: 



FTP3littlen TELNETllittler^ TELNEBlittle 4. 1 3 



FTFT>middler\ TELNETlmiddle 
The other confidences are: 



5.6 



= 7S.75%- 



— = 98% — = 100% — = 82.6% - = 80% — = 72.86% ^ = 9693% 
4.08 4.08 5 5 5.6 4.26 

9 9 9 9 9 

If the minimum confidence is min_conf=60%, these nine rules is satisfied. 



5 Conclusion 

This paper discusses the fuzzy data mining and its application in analyzing network 
traffic. The emphases in this paper are the mining algorithm of the fuzzy association 
rules and how to analyze the network traffic flow using this algorithm. The 
conclusions of this paper are stated as follows: 

In the first, the unit of network traffic is traffic; this is broader in the research. We 
can predict the intending condition of network by mining the fuzzy association rules 
of the traffic. 
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In the second, people often use fuzzy theory to describe impersonal things. Based 
on the fuzzy association rules, we propose the concept of the fuzzy association rules 
mining. It expands the fuzzy association rules and its application region. In mining of 
the fuzzy association rules, a transaction fuzzy pattern's auspice is fuzzy. The support 
of fuzzy pattern is a total support in a transaction set. 

In the end, the research using fuzzy association rules to look for network behavior 
is still little. This paper verifies the possibility of that method through an analysis of 
an example. 
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Abstract. Feature selection plays a central role in data analysis and is also a 
crucial step in machine learning, data mining and pattern recognition. Feature 
selection algorithm focuses mainly on the design of a criterion function and the 
selection of a search strategy. In this paper, a novel feature selection approach 
(NFSA) based on quantum genetic algorithm (QGA) and a good evaluation 
criterion is proposed to select the optimal feature subset from a large number of 
features extracted from radar emitter signals (RESs). The criterion function is 
given firstly. Then, detailed algorithm of QGA is described and its 
performances are analyzed. Finally, the best feature subset is selected from the 
original feature set (OFS) composed of 16 features of RESs. Experimental 
results show that the proposed approach reduces greatly the dimensions of OFS 
and heightens accurate recognition rate of RESs, which indicates that NFSA is 
feasible and effective. 



1 Introduction 

Feature seleetion is the proeess of extraeting the most diseriminatory information and 
removing the irrelevant and redundant information from a large number of 
measurable attributes. [1] Good features ean enhanee within-elass pattern similarity 
and between-elass pattern dissimilarity. [2] The minimum number of relevant and 
signifieant features ean simplify the design of elassifiers instead of degrading the 
performanees of algorithms devoted to feature extraetion and elassifieation. So 
feature seleetion plays a eentral role in data analysis and is a erueial step in pattern 
reeognition, maehine learning and data mining. [1-5]* 

Feature seleetion algorithms presented in the literatures ean be elassified into two 
eategories based on whether or not feature seleetion is performed independently of 
the learning algorithm used to eonstruet the elassifier. The feature seleetion 
algorithms aeeomplished independently from the performanee of a speeifie learning 
algorithm are referred to as the filter seleetion approaeh. Conversely, the feature 
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selection algorithms directly related to the performance of the learning algorithms are 
regarded as wrapper selection approach. [4] 

Radar emitter signal recognition is a typical kind of pattern recognition. Lots of 
features extracted from radar emitter pulse signals with different intra-pulse 
modulation laws not only consume much time, but also introduce useless information 
to interfere with useful features because feature extraction is always a subjective 
process. Feature subset selection in radar emitter signal recognition can be considered 
as a global combinatorial optimization problem. [5] It is difficult to select the optimal 
m features from C” paths which cover all combinations of the n features. Though, 

genetic algorithm (GA) is a good search technique developed rapidly in recent years 
for the combinatorial optimization problem. But conventional genetic algorithms 
(CGA) often have slow convergent speed and premature phenomenon in applications 
and have weak capability of balancing exploration and exploitation, that's to say, the 
characteristics of population diversity and selective pressure are not easy to be 
implemented, simultaneously. [6-7] Based on the principles of quantum computing 
[8-9], Genetic quantum algorithm (GQA) [10] was presented to solve combinatorial 
optimization problem and the results demonstrate that GQA is superior to CGA 
greatly. Flowever, there are several shortcomings, such as non-determinability of 
lookup table of updating quantum gates, requiring prior knowledge of the best 
solution and premature phenomenon in GQA. 

In this paper, a novel feature selection approach (NFSA) based on quantum genetic 
algorithm and a good evaluation criterion is proposed. Because the main parts of 
feature selection methods are evaluation criterion of the optimal feature subset and 
automatic search algorithm. So a valid evaluation criterion is proposed to select the 
optimal feature subset from the original feature set firstly. Then, a novel quantum 
genetic algorithm (NQGA) is presented based on the concepts and theories of 
quantum computing. In NQGA, a novel update strategy of rotation angles of quantum 
gates, immigration operation and catastrophe operations are introduced to enhance 
search capability and to avoid premature convergence. The performances of NQGA 
are compared with GQA. After neural network classifiers are designed, the optimal 
feature subset is selected from the original feature set composed of 16 features of 
radar emitter signals using NFSA. Experimental results show that the proposed 
feature selection approach reduces greatly the dimensions of original feature set and 
heightens accurate recognition rate of radar emitter signals, which indicates that the 
introduced approach is feasible and effective. 

This paper is organized as follows. Section 2 gives the criterion function for 
evaluating the best feature subset. Section 3 describes the algorithm of NQGA in 
detail. Neural network classifiers are designed and the simulation experiments of 
feature selection and radar emitter signal recognition are made and experimental 
results are analyzed in section 4. Concluding remarks are listed in Section 5. 

2 Evaluation Criterion 

To facilitate the selection process, the quality of any feature has to be assessed via 
some well-designed criterion functions, and it is more important to design the feature 
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selection criterion for a set of features because consideration of individual feature 
does not reveal the redundancy in the input data. 

Suppose that the maximum within-class clustering of the i -th class is represented 

with C,.,. . We define C,.,. as 



C„ = max \ 



M;‘l 









k = l 



( 1 ) 



where q = 1,2,- • ■ ,N , W is the number of features, is the number of samples 
of the q -th feature of the i -th class, is the k -th sample value of the q -th 



feature of the i -th class, 



9 — 



'■ik 

? V ? ... V ? 

iM:‘> 






, £'(X.^)is the expectation 



of Xl' . p { p^\) is an integer. Similarly, the maximum within-class clustering 
C of the j -th class is 



Cjj = max <; 









Z x/-£(Z/) 



( 2 ) 



where q , p and N is the same as Eq.(12), is the number of samples of the 
q -th feature of the j -th class, x ^^ is the k -th sample value of the q -th feature of 



the j -th class, X - = 



Q Q 



jm; 



, E{X ) is the expectation of X ‘‘ . 



The minimum distance D.. between the i -th class and the j -th class is 



D, = mm{||£(A-2)-£(Z/)||} 



(3) 



Thus, the between-class separability S-j between the i -th class and the j -th class 
is defined as 



5 



(4) 



Assume that there are totally H classes to be recognized, the criterion function of 
QAFS is represented with 
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/ = 



2 



H-\ H 



i=l j=i+\ 



Obviously, the bigger f is, the better the seleeted feature subset is. 



(5) 



3 Search Strategy 

In CGA, erossover and mutation operations are used to maintain the diversity of 
population, while the evolutionary operation that quantum gates operate on the 
probability amplitudes of basie quantum states is applied to maintain the diversity of 
population in GQA. So the teehnique of updating quantum gates is a key problem of 
GQA. In referenee [10], quantum logie gates are updated by eomparing binary bits, 
fitness and probability amplitudes of the eurrent solution with those of the best 
solution in last generation. The look-up table in GQA is very eomplieated and the 
values of the angles of quantum gates are deeided diffieultly. It was pointed out in 
referenee [10] that the method of updating quantum gates was suitable to solve the 
optimization problems sueh as knapsaek problem. The reason is that the update 
strategy is based on knowing about the eriterion of the optimal solution of 
optimization problems beforehand. For example, in knapsaek problem, the eriterion 
of the optimal solution is that the number of “1” should be as bigger as possible 
within eonstraint eonditions beeause more “1” means bigger fitness of ehromosome. 
However, the eriterions of the optimal solutions eannot be known in many other 
optimization problems and in practieal applieations. Whafs more, premature 
phenomenon appears easily in GQA beeause all solutions have the same evolution 
direetion. So, a novel quantum genetie algorithm is proposed to overeome these 
shortcomings. 

3.1 Chromosome Representation 

Quantum bit (qubit) chromosome representation has good characteristics of 
representing any linear superposition of solutions and a qubit may be in the ‘1' state, 
in the ‘O' state, or in any superposition of the two, which is not in binary, numeric, or 
symbol representation. So we also adopt the representation in NQGA. The following 
description introduces the representation briefly. 

In quantum computing, the state of a qubit can be represented as 

\y/) = a\0) + /3\\) ( 6 ) 

where CX and are probability amplitudes of the corresponding states. Normalization 
of the state to unity guarantees 
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where \cc\ gives the probability that the qubit will be found in ‘O' state and |y^| gives 
the probability that the qubit will be found in ‘1' state. A system with m qubits can 

contain information of 2™ states and any linear superposition of all possible states 
can be represented as 

\¥,=Y.^k\Sk) ( 8 ) 

k=\ 



where specifies the probability amplitude of the corresponding states and 

I |2 I |2 Ip 

subjects to the normalization condition |Cj| + 1 ^ 2 ! H ^ ^ 2 ” 

probability amplitudes of Wl qubits are represented as 



ay 


a^ 




a 

m 


A 


A 




> 

1 



( 9 ) 



where 






3.2 Evolutionary Strategy 

Before evolutionary algorithm of NQGA is described in detail, two definitions and 
their interpretation are given firstly to understand easily the introduced algorithm. 
Definition 1: The probability amplitude of one qubit is defined with a pair of real 
number, ( Ct , ), as 

[a j3\ ( 10 ) 

where (X and f5 satisfy equation (6) and (7). 

Definition 2: The phase of a qubit is defined with an angle ^ as 

^ = arctan(y5 la) (11) 

and the product a and f5 is represented with the symbol d , i.e. 

d = a/5 ( 12 ) 

where d stands for the quadrant of qubit phase ^ .If d is positive, the phase ^ lies 
in the first or third quadrant, otherwise, the phase ^ lies in the second or fourth 
quadrant. So the phase of the i th qubit in Eq.(4) is 

= arctan(y5. / a.) 



(13) 
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The detailed algorithm of NQGA is as follows. 

Step 1 : Choose the population size n and the number m of qubits. Generate an 
initial population eontaining n individuals , where Pj 

j = 1 , 2 ,- • ■ ,n is the j th individual of population and P- is 





a-2 




a 

jm 








jm 



(14) 



where the values of all OCj^, P — \,2,-- ■ ,m) are 1/ V 2 whieh indieates the 

quantum superposition state is eomposed of all basie quantum states by the same 
probability at the beginning of seareh proeess. Evolutionary generation g is set 0. 

Step 2: Aeeording to probability amplitudes of all individuals in population, 
eonstruet observation states R of basie quantum states, i? = {aj , , • • • , } , where 

a- (7 = 1, 2 , • • • , n) is observation state of j th individual and U- is a binary string, 
i.e. a - — bp2 ■ ■ ■ ’ where b^k = \,2,--- , m) is a binary bit eomposed of “1” or 

“ 0 ”. 

Step 3: All individuals in observation states R are evaluated by using fitness 
funetion represented with equation (5). 

Step 4: The best solution in eurrent generation is maintained. If is more than 

the best solution in evolutionary proeess, is replaeed by and is 

maintained. If satisfaetory solution is obtained or the maximum generation arrives, 
the algorithm ends, otherwise, the algorithm eontinues. 

Step 5: Quantum rotation gate G is ehosen as quantum logie algorithm in NQGA 
and G is represented as 

COS0 -sin^ 

(15) 

sin^ COS0 



where 0 is rotation angle of G and d = k- h((X, P) . k is a eoeffieient and the 
value of k has an effeet on the speed of eonvergenee. The value of k must be 
ehosen reasonably. If k is too big, seareh grid of the algorithm is large and the 
solutions may diverge or have a premature eonvergenee to a loeal optimum, and if it 
is too little, seareh grid of the algorithm is also little and the algorithm may be in a 
stagnant state. So k is defined as a variable. In CGA, adjustable eoeffieients are 
often relative to the maximal fitness and average fitness in eurrent generation, while 
the strategy eannot be used in NQGA beeause it will affeet the good eharaeteristie of 
short eomputing time. Here, taking advantage of rapid eonvergenee of NQGA, k is 
defined as a variable that is relative to evolutionary generations. Thus, seareh grid of 




98 Gexiang Zhang et al. 



NQGA can be adjusted adaptively. For example k = 0.5e ^ where t is 

evolutionary generation and maxt is a constant determined by the complexity of 
optimization problem. The function determines the search direction of 

convergence to a global optimum. The below lookup table (Table 1) can be used as a 
strategy to make the algorithm converge. Thus, the changing value of rotation angle 
of quantum rotation gate is determined by comparing the quantum phase of the 
current solution with the quantum phase of the best solution, which is called quantum 
phase comparison approach. 

Table 1. Look-up table of function h{CK, 



t/j > 0 


t/j > 0 


h{a,P) 


IC1NC2I 




True 


True 


+1 


-1 


True 


False 


-1 


+1 


False 


True 


-1 


+1 


False 


False 


+1 


-1 



Table 1: d^ = - = arctan(;5j /tTj) , where , P^ is the probability amplitude of 

the best solution, and d^ = (X^ - P^ , = arctan (;^2 ^<^ 2 ) , where OC^ , P^ is the 

probability amplitude of the current solution. 

After the rotation angles of quantum gates are computed, the update procedure of 
NQGA can be described as 

Pf^G{t)-P] (16) 

where t is evolutionary generation, G{t) is quantum gate at t th generation and 
Pj and Pj^'^ are the probability amplitude of an individual at t th generation and 
t + \ respectively. 

Step 6: To introduce good individuals, immigration operation is introduced once to 
make the algorithm search the optimal solution easily when the algorithm iterates 
every M ^ generation. Immigration operation can make the algorithm jump out the 

sub-optimal solution. When immigration operation is made, we generate firstly a new 
generation population using the method of initialization. Though, each Cf is a random 

value from 0 to 1 and corresponding P is ±-^1 — |cf| . Then the probability 

amplitudes of several qubits of the best individual in step 4 is replaced by the 
probability amplitude of corresponding qubits of the best individual in the new 
population. 
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Step 7: If the best solution maintained is not changed in many generations, such as 
Cg generations, population catastrophe operation should be executed once because 

the algorithm has been in stagnant or has been trapped into a local solution. 
Population catastrophe operation can make the algorithm jump out the local solution 
and avoid the stagnant state. So the operation can enhance the search capability 
instead of degrading the algorithm. Operation method is that the best individual in 
step 4 is replaced by the best individual in the new population. 

Step 8: Evolutionary generation g increases 1 and the algorithm goes to step 2. 



3.3 Performance Test 



We choose two typical functions to test the performances of NQGA. They are 
(1) Multi -peak function: 



;;=io+ 

' (x-0.16)'+0.1 



XG [0.01,1] 



(17) 



Function has a lot of local optimums in its whole solution space. The optimal 

solutionis /(0.1275)=19.8949. 

(2) Two-dimension multi-modal function: 

/j = cos(2.;rxj) cos(2.;rx2)e~'''‘‘ -1 < Xj Xj < 1 (18) 

The function is a well-known multi-modal test function and has 13 local optimums 
in its whole solution space. The optimal solution is (0,0)=1. 

To bring into comparison, NQGA and GQA are used to optimize the two 
functions. Population size =10. The number of qubits is 15. The maximal 

evolutionary generation is 1000 and the error is 0.0001. In NQGA, M ^ and are 
40 and 100 respectively. The statistical results of 100 tests are shown in Table 2. 



Table 2. The statistical results of 100 tests using NQGA and GQA 



Functions 


Algorithms 


Mean generation 


Mean time 


Successful rate 


/: 


GQA 

NQGA 


177.17 

139.00 


3.5289 

3.2617 


87.00% 

100.00% 


/a 


GQA 

NQGA 


358.21 

249.07 


13.2192 

10.7918 


76.00% 

100.00% 



From Table 2, conclusions can be drawn that mean generation of NQGA is less 
than that of GQA and the more complex the function is, the more obvious the 
difference is. Successful rate of NQGA is much higher than that of GQA, which 
indicates that immigration and catastrophe operation in NQGA are very helpful to 
make NQGA jump out local optimums. Although mean time of NQGA is shorter than 
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that of GQA, the difference of mean time is not clearer than that of mean generation, 
which also results from immigration and catastrophe operation of NQGA. 



4 Classifier Design and Experimental Result Analysis 

Feature extraction and feature selection can be considered as a nonlinear 
transformation that transforms the radar emitter signal from high dimensional signal 
space into low dimensional feature space and extracts the most discriminatory 
information and removes the redundant information. Although feature extraction and 
feature selection are the key processes, they are only the first two steps in radar 
emitter signal recognition. The recognition task is to be finished only by the classifier. 
So classifier design is also an important process subsequent to feature extraction and 
feature selection in radar emitter signal recognition. 

The recent vast research activities in neural classification have established that 
neural networks are a promising alternative to various conventional classification 
methods. Neural networks have become an important tool for classification because 
neural networks have the following advantages in theoretical aspects. [11-12] First, 
neural networks are data driven self-adaptive methods in that they can adjust 
themselves to the data without any explicit specification of functional or 
distributional form for the underlying model. Second, they are universal functional 
approximators in that neural networks can approximate any function with arbitrary 
accuracy. Third, neural networks are nonlinear models, which makes them flexible in 
modeling real world complex relationships. Finally, neural networks are able to 
estimate the posterior probabilities, which provide the basis for establishing 
classification rule and performing statistical analysis. So neural network classifers are 
used generally in signal recognition. 

The structure of neural network classifier is shown in Fig. 1. In Fig.l, Lj is the 
input layer that has L neurons corresponding to radar emitter signal features 
selected. is the hidden layer and ‘tansig' is chosen as the transfer functions. is 

output layer that has the same number of neurons as radar emitter signals to be 
recognized. Transfer function in output layer is ‘logsig’. We choose RPROP 
algorithm [13] as the training algorithm of the neural network. The ideal outputs of 
neural network are “1”. The output tolerance is 0.05 and output error is 0.001. 

In our prior work [14-16], 16 features have been extracted from 10 radar emitter 
signals using different approaches respectively. The features are respectively fractal 
dimensions including information dimension, box dimension and correlation 
dimension, two resemblance coefficient features, Lempel-Ziv complexity, 
approximate entropy, wavelet entropy and 8 energy distribution features based on 
wavelet packet decomposition. The 10 radar emitter signals are respectively BPSK, 
QPSK, MPSK, LFM, NLFM, CW, FD, FSK, IPFE and CSF. The proposed NQGA 
and GQA are used to search the optimal feature subset from 16 features using the 
introduced criterion function. In the experiment, population size is 10; the number of 
qubits is 16; the fitness function is shown in Eq.(5); the maximal evolutionary 
generation is 1000; and are respectively 40 and 100. We make the 
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experiment 50 times and the ehanging fitness funetion values of NQGA and GQA in 
the proeess of seareh are shown in Fig. 2 and Fig. 3 respeetively. The statistieal results 
are shown in Table 3. 

The best feature subsets obtained NQGA and GQA are eomposed of 3 features and 
the maximal fitness values obtained NQGA and GQA are identieal. In Fig.2 and 
Fig. 3, the times of getting the maximal fitness value using NQGA are more than that 
of GQA. Table 3 shows that average fitness value of NQGA is mueh bigger than that 
of GQA. Beeause there are immigration and eatastrophe operation in NQGA, average 
generation and average time of NQGA are a little more than that of GQA. 




Fig. 1. The structure of neural network classifier 




Table 3. Accurate recognition rates using feature subset selected and original feature set 

Algorithms Average generation Average time Average fitness Maximal fitness 
GQA 465.78 22.59 19J7 28.36 

NQGA 486.84 26.08 26.81 28.36 

To bring into eomparison, feature subset seleeted and original feature set are used 
to train NN elassifiers to reeognize 10 different radar emitter signals. The struetures 
of NN elassifiers are 3-15-10 and 16-25-10, respeetively. Reeognition results are 
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shown in Table 4. In Table 4, FSS is abbreviation of feature subset selected and OFS 
is abbreviation of original feature set. All numbers in Table 4 are percentage. From 
Table 4, we can know accurate recognition rate using feature subset selected is 
99.21% and accurate recognition rate using original feature set is 96.21%, which 
demonstrates the effectiveness of the proposed approach. 




Fig. 3. The fitness values of 50 runs using NQGA 



Table 4. Statistical results of GQA and NQGA 



Type 


BPSK QPSK MPSK 


LFM 


NLFM 


CW 


FD 


FSK 


IPFE 


CSF 


FSS 


97.90 


78.17 100 


90.87 


100 


98.47 


98.78 


97.91 


100 


91.67 


OFS 


99.01 


93.75 100 


100 


100 


100 


100 


100 


100 


100 



5 Concluding Remarks 

This paper presents a novel feature selection approach based on a novel quantum 
genetic algorithm and a good criterion function for radar emitter signal recognition. 
NQGA has the characteristics of good search capability, rapid convergence and short 
computing. Experimental results verify the given criterion is an effective evaluation 
criterion of selecting the optimal feature subset and the selected feature subset can not 
only simplify the design of classifiers, but also can achieve higher accurate 
recognition rate than the original feature set. 
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Abstract. In this article a model of the biological neuronal regulator system of 
the lower urinary tract is presented. The design and the implementation of the 
model has been carried out using distributed artificial intelligence, more 
specifically a system based on agents that carry out tasks of perception, 
deliberation and execution. The biological regulator is formed by neuronal 
centres. In the model, each agent is modeled so that its behaviour is similar to 
that of a neuronal centre. The use of the agent paradigm in the model confers it 
important properties: adaptability, distributed computing, modularity, 

synchronous or asynchronous functioning. This strategy also allows a complex 
systems approach formed by connected elements whose interaction is partially 
well-known. We have simulated and tested the model comparing results with 
clinical studies. 



1 Introduction 

The complexity of operation of the biological structures related to the control of the 
organic systems gives rise to the search for technological support to facilitate this 
study. In this association between computer technology and medicine, the support 
systems for the diagnosis become particularly relevant. Those systems make available 
to the specialist complementary information whose objective is to help him in the 
diagnostic task that is, on occasion, extremely complicated. Urology is a medical 
discipline in which a diagnosis aided system can come useful to the specialist and to 
the patient. 

The model also presents other functional aspects besides help with diagnosis. 
Firstly, the possibility of experimenting with the model using values outside of range, 
experimentation that would not be possible in normal individuals. Secondly, it is 
possible to find a didactic component since the system is presented as a simulator of 
the lower urinary tract, showing the patient the possible reasons for their urologic 
dysfunctions, as well as the mechanism to cure them or alleviate them by means of 
medication or surgery. The model in this case — the study of dysfunctions of the 
lower urinary tract due to neurological causes — can facilitate the study and the 
understanding of the pathology for both specialist and patient [1]. 



V. Torra and Y. Narukawa (Eds.): MDAI 2004, LNAI 3131, pp. 104-114, 2004. 
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The proposed model is based on static and dynamic properties of the bladder and 
the urethra referred in the bibliography [2, 3, 4, 5]. Most of these papers deal with 
simplifications and assumptions. Furthermore, they focus on solving the problem 
from a global approach. In this study we deal with the problem from a distributed 
viewpoint, with emergency characteristics. 

The paradigm of intelligent agents has been used in the design and development of 
the system, more concretely agents PDF with capacity to perceive, to deliberate and 
to execute tasks. This paradigm confers on the system a great adaptability and a good 
approximation to the biological operation of the neuronal control centres [6]. 
Furthermore, it is possible to use some characteristics of the agents to include 
artificial properties to the model like self-diagnostic. 



2 The Biological Model 

The Lower Urinary Tract (LUT) carries out two main functions: storage of urine in 
the bladder and the expulsion of urine through the urethra (micturition process). The 
LUT can be divided into two parts: the mechanical system and the neuronal regulator 
[7, 8, 9]. The first part describes the LUT's biomechanics and the anatomy and 
physiology of the muscles and tissues that make it up. The second part refers to the 
anatomy and physiology of the neuronal control pathways, retransmission centres and 
exciter and inhibitory areas associated with micturition. 

As stated in the International Society of Continence [10], the lower urinary tract is 
common to the bladder and the urethra. These two elements form a functional unit 
with an evident interaction between them [11]. The voluntary nervous system 
(identified with the somatic nerves) and the involuntary nervous system (identified 
with the sympathetic and parasympathetic systems) participate in the control of the 
functions of the lower urinary tract (storage and voiding the bladder) [12]. 

The activity of the LUT consists on store the urine during a period of time and then 
to throw it away, all this in a voluntary way. The micturition process is a set of 
cyclical actions accomplished by the mechanical subsystem and the neuronal 
regulator. Basically, the regulation of the LUT is a control of pressures: bladder 
pressure and sphincter pressure. During the storage phase the pressure of the 
sphincters is higher than bladder pressure and the opposite during the micturition 
phase. It is possible to see the relation between the pressures in the figure 1. 

Both tasks — storage and expulsion or urine — are controlled by three nervous 
systems: the parasympathetic, the sympathetic and the somatic [9, 12, 13]. Each one 
of those systems is formed by one or several neuronal groups that present as inputs 
afferent signals from the bladder and the urethra and as outputs efferent signals to the 
bladder and the urethra. A neuronal centre can receive signals form other neuronal 
centres, information that can affect the behaviour of the receiver centre. 

The neuronal centres that participate in the control of the LUT can be classified in 
different levels according to their neuroanatomic localization [14, 15]: suprapontine 
centres, suprasacral centres and sacral centres (see fig. 1). 
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Fig. 1. Neuroanatomic distribution of the neuronal centres related to the control of the LUT. 
The centres communicate each other using afferent, efferent and internal signals 

The suprapontine centres are located above the pons area and we can find the 
following [2, 13]: the cortical-diencephalic (CD) centre, related to the voluntary 
beginning and stop of the micturition and the involuntary beginning of the 
micturition; the preoptic area (PA) centre which is controlled by the CD and 
facilitates the micturition; the periaqueductal grey area (PAG) centre which also 
facilitates the micturition, it receives afferent signals from the bladder and urethra and 
it passes on information to lower centres. 

The pontine centres and the thoracolumbar storage (TS) centre form the 
suprasacral centres [3, 13]. We identify two opposite pontine centres, one that 
facilitates the micturition — ^pontine micturition (PM) centre — and another related to 
the storage — pontine storage (PS) centre — . The TS centre is located in the spinal 
segment T 11 -L 2 and it is the centre better communicated: it receives afferent signals 
from the bladder and internal information from other neuronal centre and it sends 
efferent signals to the mechanical subsystem and internal information to other 
neuronal centres. It is possible to observe all their connections in the Fig. 1. This 
centre relax the detrusor — a bladder muscle — and contract the sphincter, allowing 
the storage of urine. 
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Finally, we can identify three centres in the sacral area [13, 16, 17]: two of them 
associated with micturition which are the sacral micturition (SM) centre and the 
dorsal grey commissure (DGC) centre and another associated with the storage, the 
sacral storage (SS) centre. The SM centre sends efferent signals to the detrusor, the 
internal sphincter and the urethral intrinsic muscle — ^related to the urethra — . The SS 
centre generates the contraction of the pelvic floor and the urethral extrinsic muscle 
and the DGC centre generates the inhibition of the sacral centre that facilitate 
retention. 

In the figure 1 we can observe that there are internal connections between the 
neuronal centres. The interchange of internal information between neuronal centres is 
very important for the correct functioning of the neuronal control system; this 
information can modify the activity even changing their normal behaviour if there is a 
problem in the connection or the centre which sends the information has an incorrect 
operation. This internal signals make dependence relations between the centres, 
relations that can cause the propagation of incorrect functioning in the neuronal 
regulator. 



3 The Multiagent System 

Traditionally biological systems modelling or the simulation of their functions has 
been carried out by means of mathematical models. Techniques of classic control or 
techniques allied to artificial intelligence have been used to model the biological 
regulators, the latter necessitating a base of expert knowledge (heuristic based on 
fuzzy logic or artificial neuronal networks). Intelligent agents are identified with an 
architecture based on artificial intelligence and confirmed by the knowledge of an 
expert [18]. 

3.1 Intelligent Agents 

The concept of agent as used in this work is based on a structure of perception, 
deliberation and execution [19, 20]. The use of intelligent agents to model the 
neuronal control confers some advantages on the model, such as: 

- Modularity and adaptability. The model can easily be improved by looking deeply 
into the workings of the neuronal centres that participate in the control of the 
urological system. Other neuronal centres can be included immediately, as can 
modification of the internal control of each one of the centres: it is only necessary 
to modify the system that governs the mechanism of decision. 

- Distributed computing. Modelling using agents allows a distributed computing 
scheme of calculation, distributing the computing capacity and the decision- 
making power among the different agents that configure the system; the model is 
closer to the biological system of neuronal control than another centralized 
computing scheme with two components, one mechanical and another neuronal, 
that interact but which didn't offer a distributed idea of control. 
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- Emergent properties. The interaetion between the different agents ean give 
emergent behaviours whieh are not defined in the original programmed behaviour 
of the individual agents. 

- Asynehronous behaviour. The existenee of several agents that interaet in the 
system points direetly to the possibility of a synehronous or an asynehronous 
operation. Sinee the biologieal system works in an asynehronous way we have 
ehosen this operational way to develop the system. This implies that eaeh 
operation of an agent is independent in time, only eoordinating its operation by 
means of the inherent behaviour of eaeh neuronal eentre and not by means of an 
external eoordinating meehanism in the system. This being so, the results obtained 
are eharaeteristie of a system with emergent properties. 




An agent a ean be defined by means of the following strueture: 

a = ( (|)a, Sa, Pereepta, Memo., Deeisiono., Exeea ) (1) 

where (|)ais the set of pereeptions so that: 

(|)a=Pereepto,(oi) (2) 

and Oi is the state of the neuronal regulator; Sa is the set of internal states of the agent; 
Pereepta is the funetion that gets how ehange the environment of the agent; Deeisiona 
is the funetion that assoeiates an internal state of the eentre with a pereeption; and, 
finally, Exeea carries out the aetions over the system. In the figure 2 we ean observe 
the strueture of a PDE (pereeption - deliberation - exeeution) agent. 

In our model eaeh neuronal eentre is modelled as an agent whieh is eontinually 
pereeiving information from the environment — afferent and internal signals — , 
deeiding what aetion to take and exeeuting that aetion using efferent and internal 
signals. In the figure 3 it is possible to see the state ehart of a neuronal eentre 
modelled as a PDE agent. 



3.2 A Model of the Biological Neuronal Control 

We propose a model of the neuronal regulator of the LUT based in a multiagent 
system in whieh eaeh agent model the behaviour of a biologieal neuronal eentre. The 
model is expressed with the set of agents (NC) that emulate the neuronal eentres and 
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the vision — in terms of actions and reactions — that the regulator has about the 
system regulated. This vision ("^^Ir) is expressed by the following equation: 




Fig. 3. State chart of a PDE agent 



“%=<Z,r,P, React) (3) 

Z is the set of possible states in which the system can stay; T is the set of the possible 
intentions of actions — an action proposed by an agent is represented as an intention 
of modification — in the system. P is the set of the actions — ^plans — that the different 
agents can execute with the objective of modifying their states; finally, React() is the 
function that joins the different influences of the agents. 

The states of the regulator Oj g Z can be expressed defining the value of the 
different neuronal signals and using a list of pairs, signal-value: 

Oi = ((sigi, vail), (sig2, va^),. . .,(sigcard(C), valcard(c))) / sigj g C a valj g V (4) 

where C is the set of the neuronal signals that participate in the regulation of the LUT 
and V the values that these signals can take. 

In the same way we have defined the states of the regulator it is possible define the 
influences or attempts of actions with a list of pairs of the efferent signals and its 
values. We add the empty influence whose behaviour is like a neutral element and it 
is used when an agent doesn't want to modify the system. 

The agents of the system are functioning in a simultaneous way and it is possible 
that several influences were generated in the same time. Function union is defined 
to combine the influences given by the agents. Using this function and the function of 
Reaction() it is possible to get the new state of the system. In the figure 5 it can be 
observed a scheme of the neuronal regulator modelled as a multiagent system. 

It is possible to use the model presented to characterize urological dysfunctions 
related to the neuronal system that controls the functioning of the LUT. This 
characterization can be useful to understand more accurately the origin of some 
abnormal situations — especially those related with the continence — . In order to do 
this characterization is necessary identify the incorrect states of the agents 
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consequence of deliberation mistakes — problem inside the centre — , wrong 
connections — ^problem related to signals — or propagation of mistakes — due to the 
internal signals — . 

The set of internal states of the different agents will be useful to identify 
dysfunctions. The signals that define the internal states represent the bioelectric 
activity and they are normalised. An incorrect state is defined as a situation no 
allowed in the normal behaviour of the agent — following the mles of its biological 
homologous — . 




Fig. 4. Chart of the neuronal regulator. Agents perceive how the world is from the state of the 
system (s(t)). With the perception (daft)) and the internal state (Sa (t)) it will change to another 
internal state (Sa (t+1)) and it will decide what action to take (p). The execution of that action 
will generate an influence that will try to act on the system 

The distributed structure and the parallel and asynchronous processing of the 
agents can provoke incorrect states not related to a dysfunction but consequence of 
the asynchronous behaviour of the centres. Because of this, the characterization of a 
dysfunction in a regulator agent depends on the dynamics of functioning identified by 
the evolution of its internal states. Therefore a dysfunctional situation is defined by 
the continuity of an incorrect internal state during a time. Using this information is 
possible to add a new capacity to the system: the self-diagnostic. 

Each regulator agent will have two roles: one associated to the task of regulating 
and another associated to the self-diagnostic. In order to do the self-diagnostic is 
necessary to increase the capacity of memorization of the agents: they not only are 
going to store information about their last internal state but information about its 
evolution or functioning. This information will be used to test the behaviour of the 
agent and to be able of self-detecting an incorrect state. In the figure 5 it can be 
observed the state chart of a regulator agent with self-diagnostic. 



4 Experimentation 

The model has been checked comparing the results obtained in real situations and 
those obtained with a simulation tool based on the model. This tool has been 
implemented using C-l-l- as language programming and executing the agents as 
different processes than run simultaneously. 
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We have eheeked the eentres separately and in groups. In the first example we 
eheek the behaviour of the system in a situation without dysfimetions. The results in 
terms of urodynamieal measures eoineide with real situations. In the figure 6 we ean 
see the bladder pressure and urine volume in the bladder obtained with the simulator. 
The urine volume inerease until 340 ml and after that it begins the mieturition. The 
graphie is lineal beeause we suppose in the experiments a eonstant input of urine to 
the bladder. The bladder pressure inereases progressively beeause of the relax of the 
detrasor, the elosing of the sphineters and the inerease of urine. It is possible to 
observe that after a phase of inereasing, pressure reaehes an adaptation period as a 
result of the elastie properties of the bladder. The voiding begins when the museles 
are stimulated and the sphineters are opened. 




Fig. 5. State chart of a PDE agent with self-diagnostic 



In the seeond example we show the behaviour of the system in front of a 
dysfunetion. In partieular, the dysfunetion eonsists on an alteration in the funetioning 
of the SS eentre: this eentre does not begin the last phase of storage and it eontinues 
with a low stimulation of he museles. The result is a situation of ineontinenee. We 
ean see in the figure 7 how there is a moment from whieh the pressure and the urine 
volume are eonstant. This is beeause there is a leak of urine that keeps the internal 
pressures balaneed. 
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Bladder pressure 



Bladder volume 





Time (sec) Time (see) 



Fig. 6. Pressure and volume in the bladder in a normal situation 



Bladder pressure 



Bladder volume 




Fig. 7. Pressure and volume in the bladder when there is a problem in the SS centre 



Finally we have studied using the model the relevanee of the internal eonneetions. 
To do this we have analysed around 200 artifieial dysfunetions of the agents and how 
this dysfunetions ean be propagate by means of the internal signals to other agents. In 
the figure 8 we present the results of the self-diagnostie and the analysis of the 
propagation of the dysfunetions. It is possible to observe that the diagnostie in the TS 
eentre has a 100% of reliability (a eorreet diagnosis) while the eentre with less 
reliability is the PA eentre. On the other hand the eentre with more probability of 
suffering a dysfunetion whose origin is another eentre (dysfunetion propagated) is the 
PM eentre. Another important eonelusion extraeted from this analysis is that the 
behaviour of the pontine eentres is a retransmitter behaviour whieh eoineide with the 
medieal knowledge about this eentres. 



5 Conclusion 

In this work we present an alternative foeus for the simulation of biologieal 
proeesses, designing the regulator system of the lower urinary traet using a model 
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based on intelligent agents. We have observed how the use of this methodology 
faeilitates the modelling of a biologieal system essentially distributed as the neuronal 
regulator of the lower urinary traet. Furthermore, we have add artifieial serviees to the 
model like the self-diagnostie in the agents. 

The model has been implemented and a simulation system has been ereated to 
verily the operation of the lower urinary traet when some of its eomponents operate 
with values outside of the normal ranges of aetivity. To simulate dysfunetions in the 
system we only aet on the agent eorresponding to the neuronal eentre that is affeeted 
by the dysfunetion. 




Fig. 8. Results of reliability of self-diagnostic by centre 



To eheek the validity of the model we have done several tests and we have 
observed that the results obtained with the model eoineide with the expeeted 
behaviour, aeeording to the data that ean be found in the speeialized literature and the 
information given by the urologieal speeialists. 

In a eontinuation of this work we seek to look deeply into the diagnostie task in 
order to implement a deeision help system to support the task of the physieians. We 
would like to improve the deliberative eapacity of the intelligent agents and to give a 
rieher eapaeity of eommunieation to the agents. 

Future work proposes the design of a hardware platform that puts together in its 
arehiteeture the intrinsie eharaeteristies of the multi-agent systems. This arehiteeture 
is direeted to the modelling of neuronal system regulators. 
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Abstract. Experts’ reasoning in which selects the final diagnosis from 
many candidates consists of hierarchical differential diagnosis. In other 
words, candidates gives a sophisticated hiearchical taxonomy, usally de- 
scribed as a tree. In this paper, the characteristics of experts’ rules are 
closely examined from the viewpoint of hiearchical decision steps and and 
a new approach to rule mining with extraction of diagnostic taxonomy 
from medical datasets is introduced. The key elements of this approach 
are calculation of the characterization set of each decision attribute (a 
given class) and the similarities between characterization sets. From the 
relations between similarities, tree-based taxonomy is obtained, which 
includes enough information for diagnostic rules. 



1 Introduction 

Rule mining has been applied to many domains. However, empirical results show 
that interpretation of extracted rules deep understanding for applied domains. 
One of its reasons is that conventional rule induction methods such as C4.5 [6] 
cannot reflect the type of experts’ reasoning. For example, rule induction meth- 
ods such as PRIMEROSE[9] induce the following common rule for muscle con- 
traction headache from databases on differential diagnosis of headache: 

[location = whole] A [Jolt Headache = no] A [Tenderness of Ml = yes] 
muscle contraction headache. 

This rule is shorter than the following rule given by medical experts. 

[Jolt Headache = no] 

A([Tenderness of MO = yes] V[Tenderness of Ml = yes] V [Tenderness of M2 = yes]) 

A [Tenderness of B1 = no] A [Tenderness of B2 = no] A [Tenderness of B3 = no] 

A [Tenderness of Cl = no] A [Tenderness of C2 = no] A [Tenderness of C3 = no] 

A [Tenderness of C4 = no] 

—> muscle contraction headache 

where [Tenderness of B1 = no] and [Tenderness of Cl = no] are added. 



V. Torra and Y. Narukawa (Eds.): MDAI 2004, LNAI 3131, pp. 115-126, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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One of the main reasons why rules are short is that these patterns are gen- 
erated only by a simple criteria, such as high accuracy or high information gain. 
The comparative studies [9, 11] suggest that experts should acquire rules not only 
by a single criteria but by the usage of several measures. For example, the clas- 
sification rule for muscle contraction headache given in Section 1 is very similar 
to the following classification rule for disease of cervical spine: 

[Jolt Headache = no] 

A([Tenderness of MO = yes] V [Tenderness of Ml = yes] V [Tenderness of M2 = yes]) 

A([Tenderness of B1 = yes] V [Tenderness of B2 = yes] V[Tenderness of B3 = yes] 

V [Tenderness of Cl = yes] V [Tenderness of C2 = yes] V [Tenderness of C3 = yes] 

V [Tenderness of C4 = yes]) 
disease of cervical spine 

The differences between these two rules are attribute- value pairs, from tenderness 
of B1 to C4. Thus, these two rules are composed of the following three blocks: 

A\ /\ A 2 /\ ^^3 — > muscle contraction headache 
A A 2 A A 3 — > disease of cervical spine, 

where A\, A 2 and A3 are given as the following formulae: 

Ai = [Jolt Headache = no], A 2 = [Tenderness of MO = yes] V [Tenderness of 
Ml = yes] V [Tenderness of M2 = yes], and A 3 = [Tenderness of Cl = no] A 
[Tenderness of C2 = no] A [Tenderness of C3 = no] A [Tenderness of C4 = no]. 
The first two blocks ( Ai and A2 ) and the third one ( A3 ) represent the 
different types of differential diagnosis. The first one Ai shows the discrimination 
between muscular type and vascular type of headache. Then, the second part 
shows that between headache caused by neck and head muscles. Finally, the third 
formula A 3 is used to make a differential diagnosis between muscle contraction 
headache and disease of cervical spine. Thus, medical experts first select several 
diagnostic candidates, which are very similar to each other, from many diseases 
and then make a final diagnosis from those candidates. 

In this paper, the characteristics of experts’ rules are closely examined from 
the viewpoint of hiearchical decision steps. Then, extraction of diagnostic tax- 
onomoy from medical datasets is introduced, which consists of the following 
three procedures. First, the characterization set of each decision attribute (a 
given class) is extracted from databases. Then, similarities between character- 
ization sets are calculated. Finally, the concept hierarchy for given classes is 
calculated from the similarity values. 

2 Rough Set Theory: Preliminaries 

In the following sections, we use the following notations introduced by Grzymala- 
Busse and Skowron[8], which are based on rough set theory [5]. 

Let U denote a nonempty, finite set called the universe and A denote a non- 
empty, finite set of attributes, i.e., a : U ^ Va ior a € A, where Va is called 



Mining Diagnostic Taxonomy Using Interval-Based Similarity 117 



the domain of a, respectively. Then, a decision table is defined as an information 
system, A = {U,A U {d}), where d is a decision attribute. 

The atomic formulae over B C A U {d} and V are expressions of the form 
[a = ?;], called descriptors over B, where a G B and v € Va- The set F{B, V) of 
formulas over B is the least set containing all atomic formulas over B and closed 
with respect to disjunction, conjunction and negation. For example, [location = 
occular] is a descriptor of B. For each / G F{B,V), fA denote the meaning 
of / in A, i.e., the set of all objects in U with property /, defined inductively 
as follows: (1) If / is of the form [a = v] then, fA = {sG C/|a(s) = v} (2) 
(/ A g)A = /t n gA] (/ V g)A = /a V gA] (^/)a = U - fa 

By the use of the framework above, classification accuracy and coverage, or 
true positive rate is defined as follows. 

Definition 1. 

Let R and D denote a formula in F{B, V) and a set of objects which belong to 
a decision d. Classification accuracy and coverage(true positive rate) for R ^ d 
is defined as: 

c^r{D) = P{D\R)). and kr{D) = P{R\D)), 

where [S'!, Ra, ocr{D), kr{D) and P(S) denote the cardinality of a set S, a mean- 
ing of R (i.e., a set of examples which satisfies R), a classification accuracy of R 
as to classification of D and coverage (a true positive rate of R to D), and prob- 
ability of S, respectively. 

It is notable that aR{D) measures the degree of the sufficiency of a proposi- 
tion, R^ D, and that kr{D) measures the degree of its necessity. 

Also, we define partial order of equivalence as follows: 

Definition 2. Let Ri and Rj be the formulae in F{B, V) and let A{Ri) denote 
a set whose elements are the attribute-value pairs of the form [a, v] included 
in Ri. Lf A{Ri) C A{Rj), then we represent this relation as: Ri ^ Rj. 

Finally, according to the above definitions, probabilistic rules with high ac- 
curacy and coverage are defined as: 

R d s.t. R = ViRi = V Aj [oj = Vk], aR^{D) > Sa and kr,{D) > 6^,, 

where Sa and denote given thresholds for accuracy and coverage, respectively. 



3 Characterization Sets 

3.1 Characterization Sets 

In order to model medical reasoning, a statistical measure, coverage plays an 
important role in modeling. Let us define a characterization set of D, denoted 
by L(D) as a set, each element of which is an elementary attribute-value pair R 
with coverage being larger than a given threshold, S^. That is, 
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Definition 3. Let R denote a formula in F(B,V). Characterization sets of 
a target concept (D) is defined as: Ls^{D) = {R\kh{D) > <5^}- 

Then, three types of relations between characterization sets can be defined as 
follows: (1) Independent type: Ls^{Di) r\Ls^{Dj) = (f, (2) Overlapped type: 
Ls^{Di) n Ls^{Dj) 4>, and (3) Subcategory type: Ls^{Di) C Ls^{Dj). All 
three definitions correspond to the negative region, boundary region, and positive 
region, respectively, if a set of the whole elementary attribute- value pairs will be 
taken as the universe of discourse. 

Tsumoto focuses on the subcategory type in [10] because Di and Dj cannot be 
differentiated by using the characterization set of Dj, which suggests that Di is 
a generalized disease of Dj. Then, Tsumoto generalizes the above rule induction 
method into the overlapped type, considering rough inclusion[llj. However, both 
studies assumes two-level diagnostic steps: focusing mechanism and differential 
diagnosis, where the former selects diagnostic candidates from the whole classes 
and the latter makes a differential diagnosis between the focused classes. 

The proposed method below extends these methods into multi-level steps. 
In this paper, we consider the special case of characterization sets in which the 
thresholds of coverage is equal to 1.0: Ti,o(D) = {Ri\nR.{D) = 1.0} It is notable 
that this set has several interesting characteristics. 

Theorem 1. Let Ri and Rj two formulae in Li.o(D) such that Ri ^ Rj. Then, 

otRi < aR. . 

Theorem 2. Let R he a formula in Li,q(D) such that R = Vy[ai = Vj]. Then, 
R and ^R gives the coarsest partition for ai, whose R includes D. 

Theorem 3. Let A consist of {oi, 02 , • • • , Onj cmd Ri be a formula in Li,q{D) 
such that Ri = \/j[ai = Vj]. Then, a sequence of a conjunctive formula F{k) = 
/\i=iRi gives a sequence which increases the accuracy. 

4 Rule Induction with Diagnostic Taxonomy 

4.1 Intuitive Ideas 

As discussed in Section 2, when the coverage of R for a target concept D is 
equal to 1.0, i? is a necessity condition of D. That is, a proposition D ^ R holds 
and its contrapositive ~^R ~^D holds. Thus, if R is not observed, D cannot 

be a candidate of a target concept. Thus, if two target concepts have a common 
formula R whose coverage is equal to 1.0, then ^R supports the negation of two 
concepts, which means these two concepts belong to the same group. Further- 
more, if two target concepts have similar formulae Ri,Rj € Li.o(D), they are 
very close to each other with respect to the negation of two concepts. In this 
case, the attribute-value pairs in the intersection of Li.o(Di) and Li.o(Dj) give 
a characterization set of the concept that unifies Di and Dj, D^. Then, compared 
with Dk and other target concepts, classification rules for Dk can be obtained. 
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procedure Grouping ; 
var inputs 
Lc ■ List-, 

/* A list of Characterization Sets */ 

Lid ■ List-, 

/* A list of Intersection */ 

Ls : List-, 

/* A list of Similarity */ 

var outputs 

' Lj'hst, 

/* A list of Grouping */ 

var 

k : integer-, Lg, Lgr : List-, 

begin 

L, :={} ; 

k := n 

/* n: A number of Target Concepts*/ 

Sort Ls with respect to similarities; 

Take a set of (Di, Dj), Lmax 

with maximum similarity values; 
k:^ k+1; 

forall (Di,Dj) G Lmax do 

begin 

Group Di and Dj into Dk'-, 

La La - {(Di,Li.o(Di)}; 

La -.— La — {{L)j , Li.o(-Dj)}; 

La La + {(-Dfc, 

Update Lid for DDk\ 

Update Ls ; 

Lgr ( 

Grouping for Lc, Lid, £ind Ls) ; 
Lg -.^Lg + {{{Dk,Di,D^),Lg}}-, 

end 

return Lg-, 
end {Grouping} 



procedure Ruleinduction ; 
var inputs 
Lc ' List-, 

/* A list of Characterization Sets */ 

Lid '■ List-, /* A list of Intersection */ 

Lg : List-, /* A list of grouping*/ 

/* U(D„+i,_D,.DU.{(£>£>n+2, .)■■■}}} */ 

/* n: A number of Target Concepts */ 

var 

Q, Lr : List-, 

begin 

Q,= L,-,Lr :={}; 
if (Q ^ 0) then do 
begin 

Q := Q - first(Q); 

Lr Rule Induction {Lc,Lid,Q)', 

end 

{DDk, Di, Dj) := first{Q); 
if {Di G Lc and Dj G Lc) then do 
begin 

Induce a Rule r which discriminate 
between Di and Dj ; 
r = {Ri^ Di,Rj Dj}- 

end 
else do 
begin 

Search for Li.o{Di) from Lc', 

Search for Li.o(Dj) from Lc', 
if {i < j) then do 
begin 

r{Di) -.= y Ri^L, o(Dj)^Ri — * ^Dj\ 
i"{Dj) ~ p(Dj)-Ri — > Dj-, 

end 

r := {r(Di),r{Dj)}-, 

end 

return Lr '= {r, Lr} ', 
end {Rule Induction} 



Fig. 1. An Algorithm for Fig. 2. An Algorithm for Rule Induction 

Grouping 



When we have a sequence of grouping, classification rules for a given target con- 
cepts are defined as a sequence of subrules. From these ideas, a rule induction 
algorithm with grouping target concepts can be described as a combination of 
grouping (Figure 1) and rule induction(Figure 2). 



4.2 Similarity 



Single- Valued Similarity To measure the similarity between two characteri- 
zation sets, we can apply several indices of two-way contigency tables. Table 1 
gives a contingency table for two rules, Li,o(T>i) and Li,o{Dj). The first cell a 
(the intersection of the first row and column) shows the number of matched 
attribute-value pairs. From this table, several kinds of similarity measures can 
be defined [2, 3]. It is notable that these indices satisfies the property on symme- 
try. It is notable that these indices satisfies the property on symmetry. 

In this paper, we focus on the two similarity measures: one is Simpson’s 



measure: 



mm{ (o-t-6) ,(o-t-c) } 



and the other is Braun’s measure: 



max{ (a+6),(a+c)} ’ 
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Table 1. Contingency Table for Similarity 







Li.o{Dj) 








Observed 


Not Observed 


Total 




Observed 


a 


b 


a + h 


Li.o{Di) 


Not observed 


c 


d 


c + d 




Total 


a + c 


b+d 


Cl -\- h -\- c -\- d 



Table 2. A List of Similarity Measures 



(1) Matching Number 

(2) Jaccard’s coefficient 

(3) x^-statistic 

(4) point correlation coefficient 

(5) Kulczynski 

(6) Ochiai 

(7) Simpson 

(8) Braun 

N = a + b + c + d, M = {a + b){b + c)(c + d){d + a) 



a/{a + b + c) 
N{ad-bcf/M 
{ad — be ) /\/M 

\ ( a I a \ 

2 ^ a + b a+c ' 

a 

^ (a+b)(a+c) 

a 

min { (a+b) , (a + c) } 

a 

max{ (a + b),(g+c)| 



As discussed in Section 4, a single-valued similarity becomes low when 
Li.o{Di) C Li.o(-Dj) and |Li,o(I?i)| << \Li,o{Dj)\. For example, let us consider 
when |Li,o(-Di)| = 1- Then, match number is equal to 1.0, which is the lowest 
value of this similarity. In the case of Jaccard’s coefficient, the value is 1/(1 -I- 
b) or 1/(1 -b c): the similarity is very small when 1 << &or 1 << c. Thus, 
these similarities do not reflect the subcategory type. Thus, we should check the 
difference between a + b and a -be to consider the subcategory type. One solution 
is to take an interval of maximum and minimum as a similarity, which we call 
an interval- valued similarity. 

For this purpose, we combine Simpson and Braun similarities and define an 



interval-valued similarity: 



If the difference 



max{{a-\-b) ,{a-\-c)} ’ mm{ (a+6) ,(a+c)} J 

between two values is large, it would be better not to consider this similarity 
for grouping in the lower generalization level. For example, when a -b c = l(a = 



1, c = 0), the above value will be: 



1 

1 +&’ 



. If & >> 1, then this similarity should 



be kept as the final candidate for the grouping. 

The disadvantage is that it is difficult to compare these inverval values. In 
this paper, the maximum value of a given interval is taken as the representative 
of this similarity when the difference between min and max are not so large. 
If the maximum values are equal to the other, then the minimum value will 
be compared. If the minimum value is larger than the other, the large one is 
selected. 
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Table 3. A small example of a database 



No. 


loc nat 


his 


prod jolt 


nau 


Ml M2 


class 


1 


occular per 


per 


0 


0 


0 


1 


1 


m.c.h. 


2 


whole per 


per 


0 


0 


0 


1 


1 


m.c.h. 


3 


lateral thr 


par 


0 


1 


1 


0 


0 


common. 


4 


lateral thr 


par 


1 


1 


1 


0 


0 


classic. 


5 


occular per 


per 


0 


0 


0 


1 


1 


psycho. 


6 


occular per 


subacute 


0 


1 


1 


0 


0 


i.m.l. 


7 


occular per 


acute 


0 


1 


1 


0 


0 


psycho. 


8 


whole per 


chronic 


0 


0 


0 


0 


0 


i.m.l. 


9 


lateral thr 


per 


0 


1 


1 


0 


0 


common. 


10 


whole per 


per 


0 


0 


0 


1 


1 


m.c.h. 



Definition, loc: location, nat: nature, his:history, 

Definition, prod: prodrome, nan; nausea, jolt; Jolt headache. 

Ml, M2: tenderness of Ml and M2, 1: Yes, 0: No, per: persistent, 
thr: throbbing, par: paroxysmal, m.c.h.: muscle contraction headache, 
psycho.: psychogenic pain, i.m.L: intracranial mass lesion, common.: 
common migraine, and classic.: classical migraine. 



Li.o{m.c.h.) = {{[loc = occular] V [loc = whole]), [nat = per], [his = per], 

[prod — 0], [jolt = 0], [nau — 0], [Ml = 1], [M2 = 1]} 

Li.o (common) = {[loc = lateral], [nat = thr], {[his = per] V [his = par]), [prod = 0], 
[jolt = 1], [nau — 1], [Ml = 0], [M2 = 0]} 

Li,o{classic) = {[loc = lateral], [nat = thr], [his = par], [prod — 1], 

[jolt = 1], [nau = 1], [Ml = 0], [M2 — 0]} 

Li.o{i.m.l.) = {{[loc = occular] V [loc = whole]), [nat = per], 

{[his = subacute] V [his = chronic]), [prod — 0], 

[jolt = 1], [Ml = 0], [M2 = 0]} 

Li,o{psycho) = {[loc = occular], [nat = per], {[his = per] V [his = acute]), 

[prod = 0]} 

Fig. 3. Characterization Sets for Table 3 



5 Example 

Let us consider the case of Table 3 as an example for rule induction. For a simi- 
larity function, we use a matching number [3] which is defined as the cardinality 
of the intersection of two the sets. Also, since Table 3 has five classes, k is set to 
6. For extraction of taxonomy, the interval-valued similarity is applied. 

5.1 Grouping 

From this table, the characterization set for each concept is obtained as shown 
in Fig 3. Then, the intersection between two target concepts are calculated. In 
the first level, the similarity matrix is generated as shown in Fig. 4. 

Since common and classic have the maximum matching number, these two 
classes are grouped into one category, Dq. Then, teh characterization of Dq is 
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m.c.h. common 


classic 


i.m.l. psycho 


m.c.h. 


- [1/8, 1/8] 


[0,0] 


[3/8,3/7] [2/8,2/4] 


common 


- - 


[6/8,6/8] [4/8, 4/7] [l/7,l/4] 


classic 


- - 


- 


[3/8, 3/7] 0 


i.m.l. 


- 


- 


- [2/7. 2/4] 



Fig. 4. Interval- valued Similarity of Two Characterization Sets (Step 2) 
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i.m.l. 
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- 


- 


[3/7,3/6] 


0 


i.m.l. 


- 


- 


- 


[2/7,2/4] 



Fig. 5. Interval- valued Similarity of 
Two Characterization Sets after the first 
Grouping (Step 3) 





m.c.h. Dr psycho 


m.c.h. 


- [0,0] [2/8,2/4] 
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- [0,0] [0,0] 



Fig. 6. Interval- valued Similarity of Two 
Characterization Sets after the second 
Grouping (Step 4) 



common 

classic 

i.m.l. 

m.c.h. 

psycho 

Fig. 7. Grouping by Characterization Sets 



obtained as : Dq = {[loc = lateral], [nat = thr], [jolt = 1], [nau = 1], 

[Ml = 0], [M2 = 0]. In the second iteration, the intersection of Dq and others is 
considered and the similarity matrix is obatined: as shown in Fig 5. From this 
matrix, we have to compare three candidates: [2/8, 2/4], [3/7, 3/6] and [2/7, 2/4]. 
From the minimum values, the middle one: Dq and i.m.l. is selected as the 
second grouping. Thus, Dj = {[jolt = 1], [Ml = 0],[M2 = 0]}. In the third 
iteration, the intersection matrix is calculated as Fig 6 and m.c.h. and psycho 
are grouped into Dq: Dq = { [nat=per], [prod=0] }. Finally, the dendrogram is 
given as Fig. 7. 

5.2 Rule Induction 

The grouping obtained shows the candidate of the differential diagnosis for 
matching number and interval-valued similarity. For differential diagnosis. First, 
this model discriminate between D^^common, classic and i.m.l.) and Dg (m.c.h. 
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and psycho). Then, Dq and i.m.l. within are differentiated. Finally, common 
and classic within ZI7 are checked. Thus, a classification rule for common is 
composed of two subrules: (discrimination between D-j and D^), (discrimination 
between Dq and i.m.l.), and (discrimination within Dq). 

The first part can be obtained by the intersection for Figure 6. That is, 

Dg, — > [nat = per] A [prod = 0] 

-^[nat = per] V -^[prod = 0] — *■ -^Dg. 

Then, the second part can be obtained by the intersection for Figure 5. That is, 

-^{[loc = occular] V [loc = whole]) V -^[nat = per] 

V -^{[his = subacute] V [his = chronic]) 

V -^[prod = 0] — > ^i.m.l. 

Finally, the third part can be obtained by the difference set between 
Li,g{common) and Li,g{classic) = {[prod = 1]}. 

[prod = 0] — > common. 

Combining these three parts, the classification rule for common is 

{-^[nat = per] V ^[prod = 0]) 

A {[loc = occular] V [loc = whole]) V ~^[nat = per] 

V -^{[his = subacute] V [his = chronic]) V ~^[prod = 0]) 

A [prod = 0] — > common. 

After its simplification, the rule is transformed into: 

[nat = thr] A {[loc = lateral] V ^{[his = subacute] V [his = chronic])) 

A [prod = 0] ^ common. 

whose accuracy is equal to 2/3. 

It is notable that the second part {[jolt = 1] A [Ml = 0] A [M2 = 0]) is redun- 
dant in this case, compared with the first model. However, from the viewpoint 
of characterization of a target concept, it is very important part. 

6 Experimental Results 

The above rule induction algorithm was implemented in PRIMEROSE4.5 (Prob- 
abilistic Rule Induction Method based on Rough Sets Ver 5.0), and was applied 
to databases on differential diagnosis of headache, meningitis and cerebrovascular 
diseases (CVD), whose precise information is given in Table 4. In these experi- 
ments, 6 a and were set to 0.75 and 0.5, respectively. Also, the threshold for 



124 



Shusaku Tsumoto 



Table 4. Information about Databases 



Domain Samples Classes Attributes 



Headache 


52119 


45 


147 


CVD 


7620 


22 


285 


Meningitis 


141 


4 


41 



grouping is set to 0.8.^ This system was compared with PRIMEROSE4.5[ll], 
PRIMEROSE[9] C4.5[6], CN2[1], AQ15[4] with respect to the following points: 
length of rules, similarities between induced rules and expert’s rules and perfor- 
mance of rules. 

In this experiment, the length was measured by the number of attribute- value 
pairs used in an induced rule and Jaccard’s coefficient was adopted as a similarity 
measure[3]. Concerning the performance of rules, ten-fold cross-validation was 
applied to estimate classification accuracy. 

Table 5 shows the experimental results, which suggest that PRIMEROSE5 
outperforms PRIMEROSE4.5 (two-level) and the other four rule induction meth- 
ods and induces rules very similar to medical experts’ ones. 

7 Discussion 

The readers may wonder why lengthy rules perform better than short rules since 
lengthy rules suffer from overfitting to a given data. One reason is that a decision 
attribute gives a partition of datasets: since the number of given classes are 4 to 
45, some classes have very low support due to the prevalence of the corresponding 
diseases. Thus, the disease with the low frequency may not have short-length 
rules by using the conventional methods. However, since our method is not based 
on accuracy, but on coverage, we can support the disease of frequency. Another 
reason is that this method reflects the reasoning style of domain experts. One 
of the most important features of medical reasoning is that medical experts 
finally select one or two diagnostic candidates from many diseases, called focusing 
mechanism. For example, in differential diagnosis of headache, experts choose one 
from about 60 diseases. The proposed method models induction of rules which 
incorporates this mechanism, whose experimental evaluation show that induced 
rules correctly represent medical experts’ rules. 

This focusing mechanism is not only specific to medical domain. In a domain 
in which a few diagnostic conclusions should be selected from many candiates, 
this mechanism can be applied. For example, fault diagnosis of complicated elec- 
tronic devices should focus on which components will cause a functional problem: 
the more complicated devices are, the more sophisticated focusing mechanism 
is required. In such domain, proposed rule induction method will be useful to 
induce correct rules from datasets. 

^ These values are given by medical experts as good thresholds for rules in these three 
domains. 
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Table 5. Experimental Resnlts 



Method 


Length 


Similarity 


Accuracy 


Headache 


PRIMEROSE5.0 


8.8 ±0.27 


0.95 ±0.08 


95.2 ± 2.7% 


PRIMEROSE4.5 


7.3 ±0.35 


0.74 ± 0.05 


88.3 ± 3.6% 


Experts 


9.1 ±0.33 


1.00 ±0.00 


98.0 ± 1.9% 


PRIMEROSE 


5.3 ±0.35 


0.54 ±0.05 


88.3 ± 3.6% 


C4.5 


4.9 ±0.39 


0.53 ±0.10 


85.8 ± 1.9% 


CN2 


4.8 ±0.34 


0.51 ±0.08 


87.0 ± 3.1% 


AQ15 


4.7 ±0.35 


0.51 ±0.09 


86.2 ± 2.9% 


Meningitis 


PRIMEROSE5.0 


2.6 ±0.19 


0.91 ±0.08 


82.0 ± 3.7% 


PRIMEROSE4.5 


2.8 ±0.45 


0.72 ±0.25 


81.1 ±2.5% 


Experts 


3.1 ± 0.32 


1.00 ±0.00 


85.0 ± 1.9% 


PRIMEROSE 


1.8 ±0.45 


0.64 ±0.25 


72.1 ± 2.5% 


C4.5 


1.9 ±0.47 


0.63 ±0.20 


73.8 ± 2.3% 


CN2 


1.8 ±0.54 


0.62 ±0.36 


75.0 ± 3.5% 


AQ15 


1.7 ± 0.44 


0.65 ±0.19 


74.7 ± 3.3% 


CVD 


PRIMEROSE5.0 


7.6 ±0.37 


0.89 ±0.05 


74.3 ± 3.2% 


PRIMEROSE4.5 


5.9 ±0.35 


0.71 ±0.05 


72.3 ± 3.1% 


Experts 


8.5 ± 0.43 


1.00 ±0.00 


82.9 ± 2.8% 


PRIMEROSE 


4.3 ±0.35 


0.69 ±0.05 


74.3 ± 3.1% 


C4.5 


4.0 ± 0.49 


0.65 ±0.09 


69.7 ±2.9% 


CN2 


4.1 ± 0.44 


0.64 ±0.10 


68.7 ± 3.4% 


AQ15 


4.2 ±0.47 


0.68 ±0.08 


68.9 ± 2.3% 



8 Conclusion 

In this paper, the characteristics of experts’ rules are closely examined, whose 
empirical results suggest that grouping of diseases is very important to realize 
automated acquisition of medical knowledge from clinical databases. Thus, we 
focus on the role of coverage in focusing mechanisms and propose an algorithm 
for grouping of diseases by using this measure. The above example shows that 
rule induction with this grouping generates rules, which are similar to medical 
experts’ rules and they suggest that our proposed method should capture medical 
experts’ reasoning. This research is a preliminary study on a rule induction 
method with grouping and it will be a basis for a future work to compare the 
proposed method with other rule induction methods by using real-world datasets. 
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Abstract. In this paper, we present an investigation into the combination of 
four different classification methods for text categorization using Dempster's 
rule of combination. These methods include the Support Vector Machine, kNN 
(nearest neighbours), kNN model-based approach (kNNM), and Rocchio 
methods. We first present an approach for effectively combining the different 
classification methods. We then apply these methods to a benchmark data 
collection of 20-newsgroup, individually and in combination. Our experimental 
results show that the performance of the best combination of the different 
classifiers on the 10 groups of the benchmark data can achieve 91.07% 
classification accuracy, which is 2.68% better than that of the best individual 
method, SVM, on average. 



1 Introduction 

The benefits of eombining multiple elassifiers based on different elassifieation 
methods for the same problem have been assessed in various fields of pattern 
reeognition, ineluding eharaeter reeognition [1], speeeh reeognition [2], and text 
eategorization [3]. The idea of eombining elassifiers is motivated by the observation 
of their eomplementary eharaeteristies. It is desirable to take advantage of the 
strengths of individual elassifiers and to avoid their weakness, resulting in the 
improvement of elassifieation aeeuraey. Our work is inspired by an idea from 
eommon sense and also from artifieial intelligenee researeh, i.e. a deeision made on 
the basis of the multiple pieees of evidenee should be more effeetive than one based 
on single pieee of evidenee. A elassifieation problem is seen as a proeess of 
inferenees about elass eoneepts from eonerete examples [4]. The inferenee proeess 
ean be modelled as forward reasoning under uneertainty, as in produetion rule 
systems, whieh allows prior knowledge (prior performanee assessments of elassifiers) 
to be ineorporated and multiple pieees of evidenee from the elassifiers to be eombined 
to aehieve preeise elassifieation deeisions. 

In the eontext of eombining multiple elassifiers for text eategorization, a number of 
researehers have shown that eombining different elassifiers ean improve elassifieation 
aeeuraey. In [5], Sebastian! provides a state-of-the-art review on text eategorization, 
ineluding this aspeet, it identifies four eombination funetions or rules used for 
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combining multiple classifiers in text categorization, including majority voting (MV), 
weighted linear combination (WLC), dynamic classifier selection (DCS), and adapter 
classifier combination (ACC). MV is the simplest approach, where the classification 
decision on each class is made on the basis of majority classifiers being in favour of 
that class for a given input [6]. WLC is a well-known approach, where the weights 
(scores) of classes are determined by individual classifiers and summed together. The 
summed weights are then used for class assignment to each test document [3, 7]. DCS 
is based the local accuracy in the neighbourhoods of test documents; the classifier 
with highest local accuracy will be dynamically selected for classifying the 
documents. Computing the local accuracy for each test document is similar to finding 
the neighbourhood of the document in a k-nearest neighbour (kNN) method [6]. ACC 
is an intermediate approach between WLC and DCS, where instead of selecting the 
best classifier with the highest local accuracy, ACC sums all the classifiers together to 
classify test documents [6]. 

The above work provides a context and motivation for our study. The contribution 
of the paper is twofold. First of all, we propose to use Dempster's rule of combination 
for combining multiple classifiers for text categorization. It provides a theoretical 
underpinning for achieving precise decisions through aggregating the majority voting 
principle and the belief degrees of decisions. Furthermore, our work is focused on 
combining the outputs from different classifiers at the measurement level and 
incorporating the prior performance (prior knowledge) of each classifier into the 
classification decision process. In contrast, the previous work described by Xu et al. 
[1], was aimed at combining the outputs from classifiers at the label level, and the 
prior performance was used indirectly for defining the mass functions, instead of 
directly for decision making. A number of other empirical studies have been 
conducted in choosing an appropriate structure for representing outputs from 
classifiers. The triplet structure used in this work is shown to give a better result than 
the dichotomous structure used in [1]. 

The paper is organized as follows. Section 1 presents a general way of representing 
outputs from classification methods and an overview of the Dempster-Shafer theory 
of evidence. Section 2 describes the definitions of evidence and mass functions. Then 
after describing a proposed technique for combining multiple pieces of evidence and 
decision rules for determining the final categories in Section 3, we present the 
evaluation measures and experimental results in Section 4. Finally a summary is given 
in Section 5. 



2 Background 

In this section, we start by introducing a general form that inductive learning methods 
may have, particularly, in the context of text categorization, and then review the 
Dempster-Shafer theory of evidence. 

2.1 A General Output Form of Classification Methods 

Generally speaking, a learning algorithm for text categorization is aiming at learning a 
classifier or mapping which enables the assignment of documents to the predefined 
categories. Formally, let D = {d\, d 2 , . . ., <7|d|} be a training set of documents, where 
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is represented by a weighted veetor {wu Wi„), and C = {ci, C2, ■■■, cpi} be a set of 

eategories, then the task of assigning predefined eategories onto doeuments ean be 
regarded as mapping whieh maps a boolean value to eaeh pair {d, c) e Z) x C. If a true 
value is assigned to {d, c), that means a deeision is made to inelude the doeument d 
under the eategory c, whereas a false value indieates the doeument d is not under the 
eategory c. 

The task of learning for text eategorization is to eonstruet sueh an approximation to 
a unknown funetion 9 sueh that makes 9: Z) x C — » [True, False], where 9 is often 
ealled classifier. Flowever, given a test doeument d{, sueh a mapping eannot guarantee 
that an assignment of the eategories to the doeument is either true or false; instead it is 
a set of numerie values, denoted by 5 = {^i, S2, ■■■, ^|c|}, whieh represent the relevanee 
of the doeument to the list of eategories in the form of similarity seores or 
probabilities, i.e. 9(1/1) = {^i, S2, ..., ^|c|}, where the greater the seore of the eategory, 
the greater the possibility of the doeument being under the eorresponding eategory. It 
is, therefore, neeessary to develop a deeision rule to determine a final eategory of the 
doeument on the basis of these seores or probabilities. 

2.2 Overview of the Dempster-Shafer (D-S) Theory of Evidence 

Consider a number of exhaustive and mutually exelusive propositions /ti, i = 1 , .., m, 
whieh form a universal set 0 , ealled the frame of discernment. For any subset ZZ, = 
{/til, •••, /tik} £ 0, hi (0< r < k) represents a proposition, ealled a foeal element, and 
when Hi is one element subset, i.e. Hi ={hi), it is ealled a singleton. All the subsets of 
0 eonstitute a powerset 2 ®, i.e. for any subset ZZ c 0 , ZZg 2 ®. The D-S theory uses a 
numerie value in a range [0, 1] to represent the strength of some evidenee supporting 
a proposition ZZ c 0 based on a given evidenee, denoted by m{H), ealled the mass 
function, and uses a sum of strength for all subsets of H to indieate a belief degree to 
the proposition H on the basis of the same evidenee, denoted by bel{H), often ealled 
belief function. The formal definitions for these funetions are given below [8]: 

Definition 1 Let 0 be a frame of diseemment, given a subset ZZ c 0 , a mass funetion 
is defined as a mapping m\ 2® — » [0,1], and satisfies the following eonditions: 

1 ) m{(j))=^ 

2 ) Y m(ZZ) = l 

Definition 2 Let 0 be a frame of diseemment and m be a mass funetion on 0 , the 
belief of a subset ZZ c 0 is defined as 

bel{H)-^B^Hm{B) (1) 

and satisfies the following eonditions: 

1 ) bel(^)^0, 

2 ) ble(e)=l 

When ZZ is a singleton, m{H) = bel{H). With the belief funetion above, a plausibility 
funetion is defined as 
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pls{H)^\- bel{H) (2) 

It can be seen that a belief function gathers all of the support that a subset H gets 
from all of the mass functions of its subsets, whereas a plausibility function is the 
difference between 1 and all of the support of its complement subsets. 

Definition 3 Given // c 0, let bel{H) be a belief function and pls{H) be plausibility 
function, the ignorance on// is defined as: 

ignomnce{H) = pls{H)-bel{H) (5) 



Definition 4 Let mj and m 2 be two mass functions on the frame of discernment 0, and 
for any subset H^Q, the orthogonal sum of two mass functions on f/is defined as: 



(»![ ®m2){H) 



XnY=H 

1- 

XnY=<l> 



( 6 ) 



This formula is also called Dempster's rule of combination. It allows two mass 
functions to be combined into a third mass function, pooling pieces of evidence to 
support propositions of interest. 



3 Proposed Combination Technique 

Having introduced a general representation of output information yielded by 
classifiers, we turn our attention to define evidence based on the output information 
and devise a process of combining multiple pieces of evidence derived from 
classifiers. We first look at how to estimate the degrees of belief of the evidence 
obtained from classifiers and then give the definitions of mass and belief functions. 

3.1 Define Mass Function 

As mentioned in Section 1, the output scales associated with our four classification 
methods are different. The output score of the SVM adapted in this work is the 
approximation of probability, and the others are similarity-based scores. Notice that 
these scores measure the degrees of closeness between documents and the categories, 
i.e. in a sense, the closer the documents, the greater the scores. It is not important 
which method is used for expressing the scores as long as their order remains 
unchanged. Bearing this in mind, we propose to use the following formula to 
normalize these scores v e = {^i, $ 2 , ..., ^|c|}: 

F{x) = minval -I- (x - minval) * 5 (7) 

where 5 is a coefficient, whose values will be chosen depending on the different 
classifiers, and minval = min{s\, $ 2 , ..., ^|c|}- The normalized scores are used to define 
mass and belief functions as given by formula (8) below. 
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Let 9 be a elassifier, C = {ci, C 2 , . . C|c|} be a list of eategories, and d be any test 
doeument, then an assignment of eategories to d is denoted by (p(t/) = S 2 , ^|c|}- 

For eonvenienee of diseussion, we define a funetion 03, 03(ci) = for all cigC. 
Alternatively, (p((i) is written as {p(t/) = {03(ci), 03(c2), ..., 05(C|c|)} whieh is treated as a 
general form of the output information at the measurement level, where 03(Ci) is a 
normalized seore by formula (7) above. Now we give a formal definition of mass 
funetion in this eontext. 

Definition 5 Let C be a frame of diseemment, where eaeh eategory cigC is a 
proposition that the doeument d is of eategory c;, and (p(t/) be a pieee of evidenee that 
indieates a possibility that the doeument eomes from eaeh eategory Ci£ C, then a mass 
funetion is defined a mapping, m: 2*' ^ [ 04 ] > > mapping a basie probability 
assignment {bpa) to CiE C for 1 < i < |C| as follows: 

= — where 1 < / <1 C I ( 8 ) 

XdJiCj) 

j=0 

This expresses the degrees of beliefs in propositions of eaeh eategory to whieh a 
given doeument eould should belong. It is easy to see that the mass funetion defined 
in this way satisfies the eonditions given in Definition 1. 

With formula ( 8 ), the expression of the output information (p(<f) is rewritten as 9 ( 1 /) 
= {m({ci}), m({c 2 }), ..., m({c|c|})}. Therefore two or more outputs derived from 
different elassifiers as pieees of evidenee ean be then eombined by using formula ( 6 ) 
to obtain a eombined output as a new pieee of evidenee, forming a eombined 
elassifier for elassifieation tasks. In our study, we have performed a number of 
empirieal evaluations of the performanee of the eombined elassifiers. We find that the 
use of three subsets of C, breaking 9 (<f) down into three parts, ean aehieve better 
results. More details about this aspeet ean be found in [9, 10]. 



4 Combination Method 

Let us deseribe our method for eombining multiple elassifiers in more detail. Assume 
we are given a group of elassifieation algorithms and a training data set, eaeh of 
whieh ean be used to generate one or more elassifiers based on different training 
methods. For example, using ten eross-validation, eaeh of the methods ean generate 
ten elassifiers. The eombination task of multiple elassifiers, in this eontext, is to 
summarize the aggregated results output by the elassifiers whieh are either derived 
from one learning method or distinet learning methods. We foeus our interest on the 
latter. 

Let T* be a group of K learning methods, ^ * , (p \ ,..., ^ ‘ be a group of assoeiated 

elassifiers, where \ <k<K and n is a parameter that is related to validation methods, 
then eaeh of the elassifiers assigns an input doeument <7 to a triplet Y, denoted by 
^ ‘ {d) — 7 ‘ where 1< i < n. The results output by multielassifiers are represented as 
a matrix: 
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7 * 






7 " 

^ n 

7> 



( 9 ) 



where each row k corresponds to one of learning methods, and each column i 
corresponds to one of the classifiers; 7 ‘ is the result yielded by the classifier i which 

is derived from the classification method k. For example, in the present study, the 
number of classification methods K = A, and for 10-fold cross-validation method, 
n=10 classifiers will be generated by each of the classification methods, denoted by 
{(pi , (p{ (pi } . Thus the combination task based on this matrix is made on the 

columns, i.e. for each column, all the rows will be combined using formula (10), 
thereby producing a new mass distribution over all the categories that represents the 
consensus of the assignments of the multiple classifiers to test documents. The final 
classification decision will be made by using the decision rule of formula (12). 

bel (A) = ® =[...[[ m* ©ra^]©m^]©...®m^] (10) 



4.1 Conditions for Applying Dempster's Rnle of Combination 

In the above section, we have illustrated the means for combining pieces of evidence 
derived from different evidence sources. Flowever, such operations have to be 
performed under certain conditions. In [11], Shi indicated that given two pieces of 
evidence e\ and 62 , and two sets of propositions Pi and P 2 (P\rp 2 =(|)), a pre-requisite 
of applying the Dempster's rule of combination is depended on three major facts, viz, 
that Cl and 62 are independent of each other; that they support a single set of the 
propositions. Pi or P 2 , and that the propositions are mutually exclusive. In the 
scenario of combining multiple classifiers, evidence sources are defined on the basis 
of outputs obtained from the different classifiers, and a set of propositions composed 
of all the predefined categories. As a result, both evidence and propositions comply 
with the above constraints, and the use of Dempster's rule of combination is justified. 

4.2 Decision Making abont Classification 

Majority voting is a simple decision rule that is widely used in classification problem. 
In combining the results of multiple classifiers, each result output from an individual 
classifier can be treated as a vote, and the decision rule on the basis of votes received 
from classifiers is given by the formula below: 

D{x) = c if maXcec tp\^) ^(x*K (11) 

where D is a combined classifiers, x is a test document, K is number of classifiers, 1< 
k<K, and 0 < a < 1 . As a threshold we only consider that the maximal vote of a final 
assigned category exceeds a certain value such as a = 0.5. There may exist a case 
where there is more than one category that receives the maximal vote or the vote of 
the maximal vote is not considerably larger than the vote of the second maximal, or 
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they are equal. In sueh eases, to ensure that only one eategory is seleeted as the final 
assignment of a given doeument, the top ehoiee is seleeted as the final eategory, and 
the other eategories must be left out. In addition, there also may be a ease where an 
opponent reeeives a large vote. In this situation, a deeision made as above may not be 
reliable beeause it eonsiders only the largest number of votes as the eonsensus 
deeision without taking aeeount of the degrees of belief on the deeisions made by the 
elassifiers. 

A deeision rule defined on the basis of the D-S theory of evidenee is different from 
the majority voting prineiple. D-S makes use of evidenee aeeumulated from multiple 
elassifiers in making deeision. It not only eonsiders the majority agreement on the 
deeisions that is reeeived from elassifiers, but also ineorporates the degrees of belief 
assoeiated with these deeisions into this deeision proeess. So it provides an effeetive 
means to reeoneile deeisions made by multiple elassifiers. 

With all belief values of eategories to whieh doeuments eould belong obtained by 
using Equation (13), we ean define a deeision rule for determining a final eategory in 
general eases below: 

D(x)^A iibel{A)^maxAscbel{A) (12) 

However, there is an extreme ease where in the eombined result of two triplets, the 
first deeision made is different from the seeond one, but the differenee of their 
degrees of eonfidenee is very small, this leads to a eonfliet situation. To eope with 
this, we devised a sophistieated deeision rule. More details ean be found in [10]. 

5 Evaluation 

There are a number of methods to evaluate the performanee of a maehine learning 
algorithm. Among these methods, one widely used in information retrieval and text 
eategorization is a pair of measures, ealled Preeision and Reeall, and denoted by p and 
r respeetively. The two measures are defined on the basis of a eontingeney table of 
elasses as illustrated in Table 1 . Preeision is defined the ratio of the number of true 
eategories to the number of predieted true eategories, and Reeall is defined the ratio of 
the number of tree eategories to the number of predieated eategories. The two 
measures are given by formulas (13) and (14), respeetively. 



Table 1. A two-way contingency cardinality table 



Ci 


true eategory 


false eategory 


predieted true eategory 


a 


b 


predieted false eategory 


c 


d 



P = ^ (13) 

a b 

r = (14) 

a + c 

On the basis of Preeision and Reeall, to evaluate performanee aeross multiple- 
eategories, three other eonventional methods are further defined, namely Fi measure, 
miero-average and maero-average. The measure, initially introdueed in [12], 
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combines Precision and Recall as a harmonic mean of the two measures in the 
following form: 

Fi{p,r) = (15) 

p + r 

To obtain average performance on each individual category, the micro-averaged F\ 
can be calculated using formula (15). The micro-averaged F\ scores are computed by 
first creating a global contingency table whose cell values are the sum of the 
corresponding cells in each category contingency table, and then using this global 
contingency table to compute micro-averaged F\ scores. Given a set of test documents 
with m categories, we denote by T’i(Ci) the Fi value of the category Cj. The micro- 
averaged Fi is given by formula (16). This measure will be used in cross-method 
comparisons in this work. 



micro-averaged p^=— (16) 

m 

The macro-averaged F\ scores are computed by first computing the F\ scores of 
each category contingency table and then average these scores to compute a global 
average. This measure is not used in this work. There is an important distinction 
between the two measures. The micro-averaged score gives equal weight to every test 
document, and therefore considers each document average. A macro-averaged score 
gives equal weight to every category, regardless of its frequency, and therefore 
considers each category average. 

5.1 Newsgroup Data 

For our experiments, we have chosen a benchmark dataset often referred to as 20- 
newsgroup. It consists of 20 categories, each category has 1,000 documents (Usenet 
articles), so that the dataset contains 20, 000 documents in total. Except for a small 
fraction of the articles (4%), each article belongs to exactly one category [13]. 

In this work, we use 10 categories of documents. The documents within each 
category are further split into 5 groups based on the number of documents, i.e. 200 
documents, 400, 600, 800, and 1000 documents. Table 2 gives the configuration of 
ten categories of documents. 



Table 2. Configuration of ten categories of documents (Cp alt.atheism; C2:comp.graphics; 
Csicomp.os.ms-windows.misc; C4: comp.sys.ibm.pc. hardware; C5: comp.sys.mac.hardware; Cd- 
comp.windows.x, Cp. misc.forsale; Cg: rec. autos; C9: rec.motorcycles; Cio: rec.sport.baseball) 



Group 


Cl 


C2 


C3 


C4 


C5 


C6 


C7 


C8 


C9 


CIO 


Total 




V 


V 


V 


V 


V 


V 


V 


V 


V 


V 


20,000 




V 


V 


V 


V 


V 


V 


V 


V 


V 


V 


40,000 




V 


V 


V 


V 


V 


V 


V 


V 


V 


V 


60,000 




V 


V 


V 


V 


V 


V 


V 


V 


V 


V 


80,000 


1000 


V 


V 


V 


V 


V 


V 


V 


V 


V 


V 


100,000 
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5.2 Performance Analysis 

We use information gain as a measure for feature seleetion at pre-proeessing stage for 
eaeh elassifier, and this stage also involves the task of removing funetion words 
before weighting terms by using tfidf (term ^requeney and inverse Joeument 
ytequeney) [14]. Based on the eonfiguration of the above data set, we seleet 5300 
features from eaeh group on average for training and testing. 

In our experiments, we use the ten eross-validation method to evaluate 
effeetiveness of eaeh elassifier generated by the four different elassifieation methods 
of the SVM [15], kNNM [16], kNN [17] and Roeehio [13], and their various 
eombinations. The evaluation proeedure first partitions eaeh group of data (i.e. 200, 
400, 600, 800 and 1000) into 10 disjoint subsets with equal size, and then trains and 
tests eaeh elassifieation method 10 times, using eaeh the 10 subsets in turn as the test 
set, and using all the remaining data as training set. In this way, eaeh elassifieation 
method will generate 10 elassifiers, and these elassifiers are eorrespondingly tested 10 
times eaeh. 

With respeet to eaeh elassifier of a given elassifieation method, its testing result is 
put into a eontingeney table. For the ten eross-validation method, ten eontingeney 
tables will be generated and then are measured by mieroaverage F\, as the 
performanee of the elassifieation method, i.e. classification accuracy (CA). For the 
performanee on eaeh group of data, it is a mean value averaged over the 10 eategories 
of doeuments, refereed to as average category performance (AQ; whereas For the 
performanee on eaeh eategory, it is a mean value averaged over the five groups of 
data, refereed to as average group performance (AG). Notiee that some elassifiers 
may not eonsistently perform well on the different eategories or groups of doeuments. 
For example, SVM is not always the best on all the groups of data e.g. the 
performanee of kNNM is better than that of SVM on “200” group of doeuments. Thus 
the performanee of one elassifieation method, in this eontext, is an average of the 
average category performance and the average group performance, whieh ean be 
ealeulated by formula (20) below: 

I TCI 

y AC 

CA = ^ (17) 

I AG I 

Aeeording to their performanee, the individual elassifiers are sorted, denoted by is’, 
E^, ..., E", where E^ is the best elassifier, and E" is the worst elassifier. After that the 
outputs produeed by different elassifiers will be eombined to obtain the new 
eombined results, whieh will be measured by mieroaveraged F\ in the same way as 
above as the performanee of the eombined elassifiers. We are more interested in the 
eombinations of the best elassifier, the worst elassifiers, and the eombined elassifiers 
averaged over the different eategories and groups of data. The experimental results 
are presented in Figure 3, 4, 5. 
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Fig. 1. Difference between the performance the best individual classifier and the best combined 
classifier 




Numbers of documents in 10 categories 



Fig. 2. Difference between the performance of the worst classifier and that of the worst 
combined classifier 



6 Conclusion 

In this work, we have developed a set of methods for eombining multiple elassifiers 
using Dempster's rule of eombination. Various experiments have been earried out on 
a benehmark data of 20-newsgroup individually and in eombination. We also study 
the different eombinations of the best elassifier, the worst elassifier, and the eombined 
elassifiers averaged over the different eategories and groups of data. Based on our 
experimental results, we ean make following observations. 
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Fig. 3. The perfonnance difference of combined classifiers against the four classifiers of SVM, 
kNNM, kNN and Rocchio on average 



• Comparison between the best individual elassifier and the best eombined 
elassifier on five different groups of data, it is observed that the performance 
of the best combined classifier increases, but when the number of documents 
increases, its classification accuracy decreases. The increased classification 
accurancy of the best combined classifer is 2.58% on average, better than the 
best individual classifier (Fig. 1.). 

• Comparison between the worst classifier and the worst combined classifier 
over five groups of data. When the number of documents increases, the 
performance of the combined classifier increases, but not monotonically. On 
average, the performance of the worst combined classifier increases 8.68 % 
compared to the worst individual classifier (Figure 2). 

• Comparison between the average category performance of the combined 
classifiers and the individual classifiers on average. It can be observed that 
when the number of documents increases, the average category performance 
of the combined classifiers is 3.4% better than that of individual classifiers 
(Figure 3). 

To our knowledge, this work is the first attempt at using Dempster's rule of 
combination for integrating different classification methods for text categorization. 
The experimental results have shown the promise of this approach. To consolidate 
this work, more comprehensive experiments on the benchmark data will be carried 
out. 
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Abstract. We investigate the accuracy of parameters in the Logic Scoring of 
Preference (LSP) criterion functions for system evaluation. Main parameters 
are weights and conjunction/disjunction degrees (andness/orness). Weights 
reflect the level of relative importance of various decision variables. 
Andness/orness describes a desired level of simultaneity/replaceability in 
satisfying component criteria. These parameters are assessed by one or more 
professional evaluators and their values differ from (usually unknown) 
optimum values. In this paper we identify all potential sources of errors in LSP 
criterion functions. Our goal is to investigate the distribution of errors, their 
average values, and the quality of individual evaluators and their teams. 



1 Introduction 

Complex criteria for system evaluation are functions used for computing a global 
degree of satisfaction of a set of requirements. We assume that such criteria are 
quantitative models of existing expert knowledge. The fundamental goal of the 
criterion building process is to extract all existing expert knowledge and include it in 
the criterion function. More precisely, the expert knowledge is used to determine the 
structure of a criterion function and the best values of its parameters. Of course, there 
are always limitations of expert knowledge that cause errors in the organization of 
criterion functions. In this paper we analyze origins of these errors. 

Suppose that we have n requirements and use them to evaluate n system 
performance indicators Xj,V 2 ,...,v„ . For example, if the evaluated system is a 
computer, then x, may denote the capacity of its memory. For each competitive 
system, we assume that performance indicators can be precisely defined and in 
majority of cases also accurately measured. The measured values are then used for 
computing a global degree of satisfaction of all requirements. This indicator is called 
the global preference and denoted . We interpret the global preference as a degree 
of truth in the statement that system completely satisfies all requirements. So, 
Eq g [0,1] . The global preference is computed using a criterion function L, as 
follows: 



Eq = L(xi,X2,...,x„;pi,p2,...,pJ, x^eR,, p,eR 
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The set of m parameters eontains weights, andness/omess, and other values 

that are estimated by evaluators. The aeeuraey of Eq depends on the aeeuraey of 
parameters /?[,...,/?„. Sinee Eq refleets the global satisfaetion of all requirements, 
and these requirements are subjeetively speeified by evaluators, it follows that the 
optimum values of parameters p„ are obtained as mean values of opinions of a 
large population of qualified evaluators. 

Praetieal evaluation of eomplex systems requires expert knowledge of deeision 
methods, plus domain expertise (detailed knowledge of the evaluated systems). 
System evaluation teams usually eonsist of a single evaluator (deeision modeling 
expert) and a few domain experts. Consequently, the number of experts in evaluation 
teams is regularly rather small. This eauses parameter assessment errors that ean yield 
errors in the global preferenee E q and errors in ranking of eompetitive systems. 

The problem of robustness of deeision models has a long history. The primary 
interest in this area was for simple linear weighted seoring models: 

Eq=^ Wfgi (Xi ), Wi >0, Z Wi = 1, gi (Xi ) G [0,1], / = !,..,« 

i=l i=l 

Funetions g, (v, ), i = , are elementary eriteria (equivalent to value funetions in 

MCDM [2]). They represent individual requirements used to evaluate performanee 
indieators Vj,...,v„ . Parameters of elementary eriteria are one of (minor) sourees of 
evaluation errors. The weights refleet the relative importanee of inputs. 

Errors in weight assessment, AW/ = Wj - If, , i = 1,..., n , eause errors in Eq . 

The problem of errors in weight estimation has been analyzed in literature. There 
are two general approaehes to weight generation: (1) direet estimation by experts who 
use weights to express the relative importanee of analyzed eomponents [3,14,16, 
20,21,23], and (2) eomputation of weights from eonditions that are speeified by 
deeision makers [2,10,12,13,15,22,24]. Some authors use eonditions to eompute 
weights using optimization methods (e.g. linear or nonlinear programming) [15]. 
Others derive “surrogate weights” from ranking [1,18]. One of the early papers [5] 
showed that the levels of satisfaetion of individual requirements are positively 
eorrelated and this globally reduees the effeets of errors in weight estimation. 

In system evaluation praetiee based on the ESP method, we assume that the 
number of inputs n is rather large (for eomplex systems usually greater than 100), and 
that evaluation ineludes professional evaluators who are able to provide expert 
knowledge beyond simple ranking of relative importanee. If the number of analyzed 
eomponents in a group is not too big, then in addition to suggesting the ranking of 
eomponents, experts ean also quantify ranking relations by expressing the level of 
importanee (weight). The ESP eriteria eombine inputs in small groups (2 to 5 inputs), 
and this faeilitates direet assessment of weights. In addition, for small numbers of 
inputs mathematieal programming teehniques are usually not neeessary. 

In this paper we investigate the errors made in the proeess of defining elementary 
eriteria, and in direet assessment of weights and andness by expert evaluators. Our 
goal is to determine origins and ranges of possible errors. These results ean serve as 
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inputs for a future analysis of the reliability of evaluation based on complex LSP 
criteria. 



2 Parameters of LSP Criterion Functions 



The Logic Scoring of Preference (LSP) method for evaluation of complex systems 
was introduced in 1973 [7], based on the concept of andness (or conjunction degree) 
proposed in [6] and expanded in [8]. A rather detailed presentation of the LSP 
method can be found in [9,11,12,23]. 

LSP criteria are logical structures, and the whole system evaluation process is 
embedded in its natural environment, which is continuous logic [9]. A set of 
elementary criteria is used for computing elementary preferences 
Ei = gj{xi), i = l,..,n. All preferences are continuous logic variables (E^ is the 
degree of truth in the statement that the performance variable v, completely satisfies 
the corresponding i* requirement: 0< Ej < 1, i e {1,.., n} . The elementary preferences 
are aggregated using a stepwise aggregation process that applies a spectrum of logic 
functions that are obtained using negation (v HA 1 - v) and a partial 
conjunction/disjunction function (PCD, with special cases partial conjunction (andor) 
and partial disjunction (orand)). The PCD function, denoted by symbol 0, is based on 
weighted power means [6,8]: 



E^(>E2^...^Ef. = 









p(c,k) 



V/=l 



k 

0<c<l, -oo< p{c, k) < + 00 , Z ^ 

1=1 



This function uses the andness c (or the omess d=l-c) to compute the weighted 
power mean exponent r = p{c, k) . Following is the computation of the exponent r: 



p(r,k) = = 



1 1 1 

f f-f 


+ 

+ 
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J J J 1 

0 0 0 


1 ^ J 



Mr 

dEydE2 



■■■dEk 



1 1 1 

p{-oo,k) = (Ei aE2 a---aE^) = [ [ ■[(-. ax 2 a • ■ • a ) dx^dx 2 ■ ■ ■ dx„ 

0 0 0 



1 

k + l 



1 1 1 

pi+oo,k) = {EiV E2V---V Ei,) = I I ^2 ^ ^n) dxidx2 ■■■dx^ 

0 0 0 



k 

k + \ 
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From ^{r,k) = \k-c{k-V)\l{k + \) we can numerically compute the exponent 
r = p{c, k) . Of course, p{\, k) = , p{Q, k) = +o° , and p{Q.5,k) = l . 

The result of an LSP criterion is a global preference, obtained by stepwise logic 
aggregation of elementary preferences [9]: Eq = L(Ei,...,E„) = L(gi(xi),...,g„(x„)) . 
In system evaluation practice n can be large, and the stepwise aggregation process 
generates a tree structure where each node represents a well-defined subsystem of the 
evaluated system [9,23]. The most frequently used aggregation operators are partial 
conjunction, partial disjunction, neutrality (arithmetic mean), conjunctive partial 
absorption [10], disjunctive partial absorption, and their combinations [12]. 
Therefore, the LSP criterion model is a nonlinear function of many variables that 
includes three groups of parameters: (1) parameters of elementary criteria 
gi(xi),..,g„(x„) , (2) weights, and (3) andness/omess in preference aggregation 
operators. 



3 Precision of Elementary Criteria 

Elementary criteria are usually defined using piecewise linear form with the 
coordinates of breakpoints as parameters. This may cause a minor imprecision of the 
corresponding elementary preference ( AE ^ ) yielding AEq = (BEq / dE^ )AE^ . Such 

imprecisions can be reduced by increasing the number of breakpoints. For 
unnormalized weights, Stewart [22] showed that additive models could tolerate minor 
imprecisions in inputs (up to 10%), caused by errors in elementary criteria. 

Two other sources of errors related to elementary criteria may be the omission of a 
criterion and redundancy of criteria. If a criterion is relevant, it is not likely that it can 
be overlooked by professional evaluators. The uncontrolled redundancy of 
elementary criteria is a pitfall that can be easily detected and avoided. 



4 The Problem of Modulo 5 Rounding in Weight Distribution 

Quantitative evaluation criteria frequently appear in publications [14,16,20,21,23]. 
These models regularly use percent weights (integers values from 1% to 99%). If we 
analyze the distribution of the last digit of weights in these criteria for typical linear 
scoring [14,16] and LSP [23] models we get the results shown in Table 1. 



Table 1. Distribution of the last digit of percent weight 



Last digit 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Lin. [14] % 


61.7 


0 


1.3 


1.3 


0 


31.8 


2 


1.3 


0.6 


0 


LSP [23] % 


88.3 


0 


0 


0 


1.6 


9.4 


0 


0 


0.7 


0 


Lin. [16] % 


53 


0 


0 


0 


0 


47 


0 


0 


0 


0 



These results indicate that evaluators predominantly use values that are rounded to the 
closest modulo 5 value (more precisely, the condition is: weight modulo 5 = 0). It is 
not difficult to find criteria where all main components are rounded modulo 10. 
Consequently, the weight rounding errors can be substantial. The weight of 50% in 
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modulo 10 rounding represents all values between 45% and 55%; in the ease of 
modulo 5 rounding, it represents the range from 47.5% to 52.5%. Obviously, absolute 
differenees of several pereent ean easily ereate signifieant relative errors. 

To investigate this problem we performed three experiments in absolute judgment. 
Two experiments used geometrieal patterns (segmented lines or pie eharts) and one 
experiment used a simplified eriterion for PC evaluation. The first experiment 
ineluded 4 lines divided into 2, 3, 4, and 5 unequal segments. This experiment is 
similar to one of experiments reported in the elassieal Miller's paper [17], and verifies 
Miller's observation that in this area it is possible to aehieve high aeeuraey of results. 
The seeond experiment was similar, but ineluded 4 pie eharts dividing 100% into 2, 3, 
4 and 5 unequal parts. Partieipants were asked to assess the values of eaeh of 14 
linear and 14 eireular segments as aeeurate as possible, using only observation. In 
these experiments we ean assume that evaluators (eomputer seienee students) have 
suffieient expertise for the given evaluation. 

The third experiment was based on a miero model for seleeting a PC for a typieal 
eomputer seienee student using only the following eomponents for evaluation: (1) 
Central unit (main memory, proeessor speed, and disk memory), and (2) Peripherals 
(VDU, printer, and audio equipment). Partieipants were asked to estimate 9 values: 
relative weights for 3 eomponents of eentral unit, 3 eomponents of peripherals, 
relative weights of eentral unit versus peripherals, and the level of andness between 
eentral units and peripherals. 

The preeision of individual assessments made by evaluators depends on two 
eomponents: (1) skill (based on professional training and experienee) and (2) effort 
(based on the available time and evaluator's motivation). The skill and effort of 
professional evaluators ean be simply elassified in three eategories: A=average, 
H=high (above the average), and L=low (below the average). This gives to eaeh 
evaluator the “SE (skill/effort) rating"" in one of the following nine eategories: SE g 
{HH, HA, HE, AH, AA, AL, EH, LA, EL}. The interpretation of experimental results 
should be based on elassifying experiments aeeording to the SE level. In all our 
experiments the effort was limited by the available time and it is estimated to be low. 
We estimate that our experiments with linear and eireular geometrie segments refleet 
the AL level, and the experiments with PC evaluation miero model refleet the EL 
level. 

The granularity of weight assessment ean be used as an indieator of the preeision 
of evaluation models. It may also refleet the quality of evaluators (SE rating). Of 
eourse, the quality of evaluators also depends on other faetors sueh as the number of 
elementary eriteria, granularity of andness and other logie relationships, ete. 
Generally, higher granularity yields less error and more reliable evaluation results. 

Weight distributions generated by these experiments are shown in Fig. 1. The first 
two experiments with geometrie patterns showed predominant modulo 5 rounding 
(approximately 70% of weights). This indieates that evaluators feel eomfortably with 
20 levels of relative importanee. Many of evaluators were able to generate results that 
are more preeise. This is visible in Fig. 1 as a uniform low-level distribution of 
weights that are not modulo 5 rounded. 

In the ease of PC evaluation miero model 73% of weights were rounded modulo 
10, and 94% of weights were rounded modulo 5. In other words, in 21% of eases 
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evaluators were able to differentiate up to 20 importanee levels, and only in 6% of 
eases more than that. This shows that the majority of evaluators were not suffieiently 
prepared for this evaluation task, yielding the LL rating. 
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Fig. 1. Distribution of estimated weights for 1316 weights of linear segments, 1526 weights of 
pie-chart segments, and 576 weights of components of PC evaluation micro model 

Miller [17] reported that there are between 10 and 15 distinet positions along a 
linear segment that can be recognized by evaluators. Our results in subsequent 
sections show that in the case of assessment of weights it is realistic to expect 
substantially higher accuracy. The limits on our capacity for processing information 
are clearly visible in the area of systems evaluation. Fortunately, professionally 
prepared evaluators normally attain the accuracy that is two times above the 10-15 
positions level. 
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5 Accuracy of Weight Assessment 



It is useful to classify all experimental results of weight assessment into four groups: 

• Outliers, that contain values that are far off the exact values. 

• Data that violate the ranking of exact values. 

• Data that have correct ranking but may contain outliers. 

• Correct data, that are without outliers and consistent with exact ranking. 

Since the exact values in our experiments were selected to be sufficiently different 
from each other, we did not expect the ranking errors. However, 32% of evaluators 
made ranking errors in the linear pattern case, and 28% made ranking errors in the 
circular pattern case. These errors can be attributed to insufficient effort caused by 
short assessment time. A summary of results obtained for these four groups of data is 
shown in Table 2. It is reasonable to expect that professional evaluators put sufficient 
effort in their work and that they belong to the group with exact ranking and without 
outliers. 



Table 2. Summary of weight assessment experiments for a group of evaluators 



Line Segments [L] 
Pie Charts [P] 


All data 
un filtered 


All data rank 
filtered 


Data w/o 
outliers 
unfiltered 


Data w/o 
outliers rank 
filtered 


Percent of modulo 5 rounded [L] 


72.26 


68.21 


71.27 


68.23 


Mean range in group estimates [L] 


17.93 


13.29 


15.29 


12.64 


Mean abs error of group estimates [L] 


0.52 


0.79 


0.45 


0.66 


Mean abs error of evaluator [L] 


2.71 


2.48 


2.53 


2.37 


Percent of modulo 5 rounded [P] 


63.50 


56.87 


61.78 


55.12 


Mean range in group estimates [P] 


20.00 


14.71 


15.50 


12.71 


Mean abs error of group estimates [P] 


0.64 


0.76 


0.52 


0.69 


Mean abs error of evaluator [P] 


2.56 


2.32 


2.30 


2.18 



In both experiments the group assessment (based on mean values) is remarkably 
accurate. The absolute difference (the group estimate minus the exact value) is 
convincingly below 1%. Individual evaluators never achieve the accuracy of group 
estimates, but the mean absolute errors of the whole population that provides correct 
ranking is always below 2.5%. 

The quality of individual evaluator is shown in Fig. 2. The points represent 
cumulative probability distribution of average absolute errors for linear and pie-chart 
segments for the population of rank-qualified evaluators. Professional evaluators may 
be selected as the best 10% of this population. For them the average absolute error is 
below 1.5%. In addition to average absolute errors. Fig. 2 also shows the number of 
distinct positions (denoted DP) that can be recognized by individual evaluators. More 
than 60% of population can recognize 20 distinct positions. Professional evaluators 
(top 10%) can recognize more than 33. 

The average absolute error of individual rank-qualified evaluator is almost 
independent on the value of weight, as shown in Fig. 3. The average error is between 
2.2 and 2.4, which is substantially below the quality we attribute to professional 
evaluators. To estimate of the upper limit of possible errors, we can use the 
experiment with PC evaluation micro model because of the lowest rating of 





Percentage of evaluators 
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participants skills and effort. In the case of weights, the average absolute error was 
9.33%. 




Average absolute error 

Fig. 2. The distribution of evaluation errors (rank filtered, without outliers) 




Fig. 3. Average absolute error of individual evaluator as a function of weight 



6 The Number of Weights in a Group 

If a preference aggregation block includes n inputs, it is necessary to simultaneously 
determine n weights. Evaluators must consider relationships between each pair of 
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inputs. For 2, 3, 4, 5, 6, 7, n inputs, there are 1, 3, 6, 10, 15, 21, n(n-l)/2 pairs. 
Obviously, the complexity of selecting weights quickly increases with a growing 
number of inputs. To keep the number of pairs close to the “magical number seven 
plus or minus two” [17], we suggest the use of up to 5 inputs (in the case of more 
inputs, we keep them in smaller groups, and then aggregate group preferences). This 
conclusion is supported by the growth of the average relative error shown in Fig. 4. 




Number of Inputs 



Fig. 4. The growth of relative errors for experiments with line and pie chart segments (2 to 5 
segments) 



7 Selecting Andness, Orness, and Partial Absorption Parameters 

The problem of selecting the conjunction and disjunction degrees (the andness c, or 
the omess d=\-c) in PCD and in functions that are more complex, can be solved in 
two ways: 

• Direct assessment from tables with discrete values. 

• Computation based on conditions specified by the evaluator. 

An example of multi-level andness/orness is the system of 17 functions that include 
conjunction (C), arithmetic mean (A), and disjunction (D) proposed in [6,7,8], and 
shown in Fig. 5. 

The increment of andness/omess between various levels of PCD in Fig. 5 is 
1/16=0.0625. This indicates that in the best case the absolute error in andness/omess 
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can be up to 3.1%. Similar situation is with the partial absorption function [10, 12] 
that normally has 3 parameters (two weights and andness) and parameters can be 
selected from specialized penalty/reward tables [10]. 



— 1 — 1 — — 1 — — 1 — 
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D- ' D-+ D+- ' D++ 


c+ J. c- 


D- ' D+ 
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DA 
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^ D 



Fig. 5. Seventeen levels of andness/omess and their symbols 

Similarly to the case of weights, the experiment with the PC evaluation micro 
model serves as a good indicator of the upper bound of the andness/omess error. Its 
absolute value is 9.49%. 

Evaluators frequently feel more confident in giving conditions that parameters or 
inputs must satisfy, than in directly giving the parameter estimates [12,24]. In the case 
of ESP criteria, they can be interpreted as preferential neural networks [12], and all 
preference aggregation parameters (andness and weights) can be simultaneously 
computed from the desired set of input/output mappings that evaluators specify as a 
training set for the neural network. In the case of weight assessment, it is possible to 
derive weights from a set of conditions using methods such a linear or nonlinear 
programming [15]. 

In all cases of computation of parameters, the accuracy of resulting parameters 
depends on the reliability of conditions given by evaluators and the quality of the 
computational model. Since the input conditions are selected only when evaluators 
feel that they are more accurate than direct estimates, it is reasonable to expect that 
the resulting accuracy of parameters is higher than in the case of direct estimates. 

8 Summary and Conclusions 

The expressive power of ESP criteria for system evaluation depends on the 
evaluator's knowledge of continuous preference logic. The simplest logic operators 
are the partial conjunction, and the partial disjunction. They have independently 
adjustable levels of importance and andness/omess. Compound logic operators, such 
as the partial absorption function, are obtained by superposition of simple operators. 
Selecting correct type of operator and adjusting its parameters, require professional 
preparation of evaluators. Such preparation regularly reduces or eliminates some of 
evaluation errors. 

Sources of errors in ESP criteria (in the order of increasing importance) are: 

• Omission of elementary criteria. 

• Uncontrolled redundancy of elementary criteria. 

• Elementary criteria breakpoint parameter errors. 

• Errors in weights of preference aggregation functions. 

• Errors in the selected level of andness/omess and/or errors in the stmcture of 
preference aggregation functions. 
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In the case of system evaluation with professionally prepared evaluators, errors in 
the area of elementary criteria should not be a source of serious concern. Such errors 
can be sufficiently reduced through careful selection of input performance variables, 
and expanding efforts needed to define sufficiently precise elementary criterion 
functions. 

The LSP method tries to reduce the selection of weights and andness/omess to 
Miller's absolute judgments of unidimensional stimuli. According to Miller, for this 
kind of judgment, the values of weights and andness should be reliably selected in 10 
to 15 distinct positions along the unit interval. We found this result to be a lower 
bound of accuracy that can be expected from unmotivated or unprepared evaluators. 
Professionally prepared evaluators can attain substantially higher granularities. 

The errors in weights can be significant. Some of them are caused by the modulo 
5 rounding problem. We found that in all cases of parameter estimation evaluators 
can easily differentiate 20 levels of relative importance. This seems to be the average 
value for the majority of evaluators working at the average effort level. In the most 
difficult situations, the number of levels may be reduced to 10. If professional 
evaluators are defined as 10% of the best in general population, then their accuracy in 
weight assessment is expected to be in the range above the 30 levels. Such accuracy 
can also be attained using specialized weight computation tools. 

Logic relationships between groups of inputs can be nontrivial, and errors in 
selecting the structure of logic aggregation and/or the appropriate levels of 
andness/omess are primary sources of errors in LSP models for systems evaluation. 
Multi-level schemes of andness/omess can yield absolute errors from 3 to 10%. This 
is the area where specialized tools and proper professional preparation of evaluators 
are the most needed. The use of specialized tools reduces errors, but increases the 
effort necessary for preparing complex criteria. 

In the case of system comparison and selection, it is important to know the level of 
confidence in the final ranking of competitive systems. The level of confidence can 
be computed from simulation models that use expected errors in weights, and 
andness/omess, as input data. The results of this paper are a necessary initial step in 
that direction. 
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Abstract. We analyze the reliability of results obtained using the Logic Scoring 
of Preference (LSP) method for evaluation and comparison of complex 
systems. For each pair of competitive systems our goal is to compute the level 
of confidence in system ranking. The confidence is defined as the probability 
that the system ranking remains unchanged regardless of the criterion function 
parameter errors. We propose a simulation technique for the analysis of the 
reliability of ranking. The simulator is based on specific models for selection of 
random weights and random degrees of andness/omess. The proposed method 
is illustrated by a real life case study that investigates the reliability of 
evaluation and selection of a mainframe computer system. 



1 Introduction 

All criterion functions for system evaluation reflect opinions of evaluators and can 
contain errors. The errors are predominantly caused by incorrect assessment of 
criterion function parameters. In the case of LSP criteria [5,6,7], the most significant 
sources of errors are errors in weights and andness/omess [8,9]. Since the resulting 
ranking of competitive systems has limited accuracy, it is necessary to perform a 
reliability analysis to determine the level of confidence in the obtained results. 

The reliability of evaluation results is a problem that has been analyzed in the 
MCDM literature [2]. Simulation techniques for the analysis of effects of random 
weights in system evaluation models that use a fixed level of andness/omess (additive 
or multiplicative models) can be found in [1,2,3,4,11]. Stewart [11,2] performed a 
detailed analysis of robustness of additive value functions. Buttler et al. [4] propose 
several methods for selecting random weights and a simulation analysis of reliability 
of additive and multiplicative utility functions. The current literature has been focused 
on reliability models that do not include variable degrees of andness/omess. In the 
case of the LSP method, however, evaluators adjust both weights and a variety of 
logic operators that include variable degrees of andness/omess. Therefore, to analyze 
the reliability of LSP criteria we have to simultaneously investigate the effects of 
random variations in both weights and the levels of andness/omess. Such a technique 
is proposed in this paper. 
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2 The Structure of an LSP Criterion 

The structure of an LSP criterion for system evaluation and comparison is shown in 
Fig. 1. Input values XjE R, i = l,...,n are performance variables. They include all 
relevant system indicators that affect the ability of evaluated system to satisfy user 
requirements. Functions g, : 7? — ^ [0,1] , / = are elementary criteria, used to 

compute elementary preferences is, e [0,1], i = Elementary preferences are 

normalized values that show the level of satisfaction of individual requirements. In 
order to compute the global preference of an evaluated system, elementary 
preferences are aggregated using a stepwise procedure. The preference aggregation 
blocks are based on weighted power means: 

(here e ^,. denote 

input preferences, and is the output preference of an aggregation block, and the 

exponent r can be computed from a desired level of andness). At the end of the 
stepwise aggregation process, we compute the global preference E g [0,1] that 
reflects the global satisfaction of all requirements. 

Elementary criterion functions are frequently based on piecewise linear 
approximations, and breakpoint coordinates are among parameters of the LSP 
criterion function. Fortunately, these parameters can be made rather accurate [8] and 
consequently the evaluation errors are caused primarily by errors of the following two 
arrays of parameters: lFi,...,lF,„ (weights), and q,...,rj. (exponents). 




Fig. 1. A general structure of an LSP criterion with parameters W and r 



Suppose that we have two competitive systems, A and B. In the case of system A 
the inputs x^x,x^ 2 ^--^^an generate the global preference , and for system B the 
inputs generate the global preference Ef^.lf E^>Ef, then system A 

is better than system B, and this is denoted A> B .li the parameters W and r of the 
criterion function are not reliable, the ranking A> B might be questionable. Thus, 
we are interested in determining the level of confidence in the resulting ranking. 
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3 The Concept of Confidence Level 

Evaluation models are subjeetive in the sense that they refleet expert knowledge of 
evaluators. Evaluators assess the values of parameters W and r. Therefore, the eorreet 
values of parameters would be those obtained as mean values for a large population 
of qualified evaluators. In praetieal evaluations, however, evaluation teams are either 
small or eonsisting of only one evaluator. Consequently, the estimated parameters 
differ from the unknown optimum values and the resulting global preferenees 

and Ej, differ from the unknown optimum values it* and eI'. -it * | = , 

-ii^l = , 0<£a<l, 0<e^<l. If {E ^ - E !^){E*^ - eI) > 0 , the ranking is 

eorreet. Of eourse, we would like to have a good estimates of errors and e* , and 
an estimate of the probability of eorreet ranking. This probability is denoted and 
ealled the confidence level. A similar indieator has been used in [1,10]. 

Assuming that E^>E[,, the maximum value of T* is 1 (or 100%) indieating the 
eomplete eonfidenee in A> B . The minimum value of is 0 indieating that By A 
and eonsequently it is impossible that A> B .li E^-Ej^ there is equal probability 
that A> B and A<B .In this ease, = 0.5. 

The eonfidenee level depends on the aeeuraey attainable by evaluators who assess 
the parameters of the eriterion funetion. We ean assume that eaeh individual evaluator 
assesses all parameters, and a population of evaluators generates a distribution of 
eaeh eomponent of the parameter arrays. A typieal weight distribution for a large 
population of evaluators is shown in Fig. 2. The mean value of this distribution is w 
and it ean be interpreted as the optimum value. 

In praetieal evaluations the optimum values of parameters (sueh as w ) and 

optimum preferenees E^ and eI remain unknown. The only information that we 

have ineludes the assessed value w, and its expeeted variation range p - 

that eontains the aeeurate value w (an experimental analysis ofp is presented in [8]). 




Fig. 2. A typical distribution of weights for the population of evaluators 
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Fig. 3. Characteristic cases of relationships between global preferences 

Suppose that we ean generate a large number of random parameter arrays r and W. 
The values should be taken from their respeetive variation ranges. A safe 
(pessimistie) approaeh is to use the uniform distribution, as shown in Fig. 2. In sueh a 
ease, one of generated points will be equal or very elose to the optimum parameters. 
For eaeh pair of sueh arrays we eompute and itj , and plot a eorresponding point 
in the , E/, square, as shown in Fig. 3. The shape of resulting distribution of global 
preferenees is shown in Fig. 3a. Fig.s 3b,e,d show eharaeteristie eases with finite 
number of sample points. Let be the number of points where E^>Ej^, and let Aj 

be the number of points where E^<Ef,. The total number of points is N^+ Nj,^ N. 

Using the probability density funetion p( E^,Ei^) shown in Fig. 3. a, the eonfidenee 
level *F ean be expressed as the probability that A> B \ 
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\dE^ \p(E^,E,)dE, 

F F 

a min bmm 

Using the simulated sample, the confidence level can be approximated as follows: 

'ViA y B) = — ^ , 0 < 'ViA >B)<1 

Na 

Both the probability density function p{ ,E[, ) and the confidence level 
depend on the variability of parameter vectors r and W expressed as their range p. 
Therefore, the confidence level is a function of range as shown in Fig. 4. 




Fig. 4. Confidence T as a function of the range of variation 

The characteristic values from this curve are: 

Pi = the safe range, defined as the largest variation range that still produces 
the maximum confidence *P=1 

= the minimum acceptable confidence level (e.g. 90%) the defines the 
maximum acceptable range of variation. 

Pmax ^ the maximum acceptable range (upper limit of the parameter 
estimation error) 

In order to compute T'(p) we need a simulator based on appropriate generators of 
random weights and random andness/orness levels. These generators must reflect the 
techniques used by experts to select the parameters of LSP criteria. 
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4 Generation of Random andness/orness 

There are two ways evaluators select parameters of weighted power means. The first 
is to select a desired level of andness/orness and use it to determine the corresponding 
exponent. The other way is to use preferential neural networks [7] or other methods 
for simultaneously computing both weights and exponents from desired input/output 
conditions. In the first case, we use a discrete set of 17 exponents shown in Table 1. 
The second approach yields exponents from a continuous interval. 



Table 1. Selection of Partial Conjunction/Disjunction 



Type of polarization 


Level of 
polarization 


Symbol 


d 

Omess 


c 

Andness 


r 

Exponent 


Disjunctive 

polarization 

(Partial disjunction) 


Strongest 


D 


1.000 


0 


+00 


Very Strong 


D++ 


0.9375 


0.0625 


20.63 


Strong 


D+ 


0.8750 


0.1250 


9.521 


Medium Strong 


D+- 


0.8125 


0.1875 


5.802 


Medium 


DA 


0.7500 


0.2500 


3.929 


Medium Weak 


D-+ 


0.6875 


0.3125 


2.792 


Weak 


D- 


0.6250 


0.3750 


2.018 


Very weak 


D- 


0. 5625 


0.4375 


1.449 


Neutrality 




A 


0.5000 


0.5000 


1 


Conjunctive 

polarization 

(Partial 

conjunction 

) 


Non 

mandatory 


Very weak 


C- 


0. 4375 


0.5625 


0.619 


Weak 


C- 


0.3750 


0.6250 


0.261 


Mandatory 

requirements 


Medium Weak 


C-+ 


0.3125 


0.6875 


-0.148 


Medium 


CA 


0.2500 


0.7500 


-0.72 


Medium Strong 


c+- 


0.1875 


0.8125 


-1.655 


Strong 


c+ 


0.1250 


0.8750 


-3.510 


Very Strong 


C++ 


0.0625 


0.9375 


-9.06 


Strongest 


c 


0 


1.000 


-00 



The method of selecting andness/orness from Table 1 consists of first deciding 
whether the polarization of an aggregation block is conjunctive (to model the 
simultaneity of requirements) or disjunctive (to model the replaceability of 
requirements). The neutrality function (arithmetic mean) is used in cases there is no 
clear dominance of conjunctive or disjunctive polarization. Selecting between 
conjunctive and disjunctive polarization is regularly done without errors. 

The most frequent polarization is conjunctive. In this case, the evaluators must 
decide whether the requirements are mandatory or nonmandatory. This is also an easy 
decision, because the mandatory requirements are those whose nonsatisfaction must 
yield the zero resulting preference. For example, no evaluator would accept computer 
without memory, and therefore the sufficient memory capacity represents a 
mandatory requirement for which the exponent r must be less than or equal to zero. 
The separation of mandatory and nonmandatory requirements is also done without 
errors. According to Table 1, we can identify the following 4 ranges: 
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(1) Partial disjunction: D— , D-,D-+, DA, D+-, D+, D++, D 

(2) Neutrality function: A 

(3) Nonmandatory partial conjunction: C— , C- 

(4) Mandatory partial conjunction: C-+, CA, C+-, C+, C++, C. 

According to the analysis made in [8] the range of variation of andness/omess is 
usually from 6% to 12.5%. Since random errors do not cross the safe mandatory 
requirement border, it is reasonable to make a discrete random andness/orness 
generator based on the operator variation range specified in Table 2. 



Table 2. Discrete random variation ranges for 17 degrees of andness/omess 



Operator 


Var. range 


Operator 


Var. range 


Operator 


Var. range 


D 


D++,D 


D- 


D-,D-,D-+ 


C-+ 


C-+, CA 


D++ 


D+,D++,D 


D- 


A,D-,D- 


CA 


C-+,CA,C+- 


D+ 


D+-,D+,D++ 


A 


D-,A,C- 


C+- 


CA,C+-,C+ 


D+- 


DA,D+-,D+ 


C- 


A,C-,C- 


c+ 


C+-,C+,C++ 


DA 


D-+,DA,D+- 


C- 


C-,C- 


C++ 


C+,C++,C 


D-+ 


D-,D-+,DA 






c 


C++,C 



The continuous approach uses the same ranges as the discrete case. If the range of 
omess is [p,q] then the random orness is generated as d=p+ (q-p) *urn ( ) , where 
urn ( ) denotes a standard uniform random number generator. Then we use the 
random value d to compute the corresponding random exponent. 



5 Generation of Normalized Random Weights 

In the case of two inputs, there are two weights: and ITj • We assume 0<1T[ <1, 

0<W2<1, and Wi+W 2 =l. In the case of three weights the equivalent conditions are 
0<lf^<l, 0 <W 2 < 1 , 0 <W 2<1 and IT^ +lf 2 + ^3 ^1- F'g- 5. shows the cases of two 

and three weights, where gray areas denote the range of variation of individual 
weights. 

Suppose that the width of all gray areas is always p=2o and that the smallest 
assessed weight is greater than 2o. Then, in the case of two variables, the random 
weights Wj and Wj can be computed using the assessed weights Wi and W 2 as 
follows: Wi = Wi + (2* urn{ ) - 1) * u ; Wj = 1 - Wj . In the case of three inputs, we 
first generate the independent uniformly distributed random deviations of borders 
between adjacent weights: 

rj 2 = (2* Mr«( )-l)*(T ; -(T<rj 2 <+(T 

T 23 = (2*Mr«()-l)*(T ; -f7<r23<+(T 

T3j = (2*Mm( )-l)*o- ; -cr<r2]<+a 

Then we compute the random weights , W 2 and by complementary adding 
and subtracting the random deviations (in this process we reject cases where the 
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obtained random weights are less than the minimum weight threshold, denoted 
W ■ )■ 

' mm /• 



+ni -''31 ; ^1 ^ ^min > 0 

W2 = W 2 +T23 -ri2 ; W2 > > 0 

W 3 = IF3 +r 3 [ -r 23 ; W3>lf„i„>0, wi +W2 +W3 =lfi +lf 2 +^3 =1 
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Fig. 5. Weights in the case of 2 and 3 inputs 



A reasonable value of the threshold is a few percent (e.g. =0.04). This 

approach can be easily generalized for the case of n weights. 

Any method for selecting random weights should reflect the way in which 
evaluators behave. For example, if evaluators firmly establish the ranking of weights 
then the above method should also reject those sets of random weights that do not 
preserve the desired ranking. 

Independent generation of random weights followed by their normalization is also 
possible: =Wj + (2*um() -1)* a ; i = l,...,n, w, = w, /(ivj + . 

Regardless it popularity in literature, we avoid this approach because it does not 
generate a uniform distribution of points in the gray area of Fig. 5. 



6 Case Study 

The proposed analysis of reliability is applied to a real mainframe computer selection 
project performed for a steel industry company. It has 94 inputs, 56 aggregation 
operators, with 60 exponents and 150 weights. There were 6 competitive systems 
(A,B,C,D,E,F) that attained global preferences A: 74.7%, B: 71.7%, C: 67.6%, D: 
67.0%, E: 29.2%, and F: 27.1%. If we focus on the leading two systems (A, and B), 
the main question is whether the difference of 3% is significant to claim that A> B . 

According to the analysis of errors of weights and andness presented in [8], in the 
case of professional evaluation we can expect variations of weights to be in the range 
2a = 5% , and the variations of andness/orness to be in the range of 12.5%. Using a 
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uniform distribution of weights and andness/omess, we generated results that are 
summarized in Fig.s 6a-f. Fig.s 6a,b,c correspond to continuous model of variation of 
andness/omess. Fig.s 6d,e,f correspond to cases where the variation of andness/orness 
follows a discrete pattern of variation. Fig.s 6a-e are based on a sample of 500000 
points, and Fig. 6.f is based on a sample of 2000 points. 
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Fig. 6. Reliability of ranking for two mainframe computer systems 

The distributions of global preferences of two leading systems in the case of 
simultaneous continuous variation of weights and andness/orness are shown in Fig. 
6a. The distributions of systems A and B clearly overlap, but the ranking remains 
stable, as shown in Fig. 6b where the confidence level is 99.7%. This indicates that 
system A rather uniformly dominates system B. Fig. 6c shows the confidence level as 
a decreasing function of the range of variation of andness/omess. The presented 
curves correspond to two cases: (1) variations only in andness/omess and (2) the 
variation in both andness/omess and weights. Their difference is insignificant. The 
safe range is 12% (and even if the range comes to 25%, the confidence or ranking 
remains at the level of 90%. This shows the robustness of ranking for the leading 
systems. 

The results shown in Fig.s 6d and 6e are similar to the case of continuous variation 
(Fig. 6. a and 6b). In the discrete case, the level of confidence is insignificantly 
reduced to 98.4%. Results in Fig.s 6a-e correspond to the sample containing 500,000 
simulated cases. In the case of a smaller sample containing 2000 simulated points, the 
points in the global preference plane are shown in Fig. 6.f. All the presented results 
prove that regardless of the small advantage of only 3%, the system A outperforms 
system B at the high level of confidence. Thus, the difference of 3% is significant for 
selecting system A as the best alternative. 

7 Conclusions 

The difference of global preference between competitive systems is not a direct 
indicator of the reliability of evaluation. Small differences do not automatically imply 
low reliability of ranking. In all system evaluation and selection projects, it is 
necessary to compute the confidence in ranking of competitive systems, and 
simulation is an effective approach to solve this problem. 
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If one system consistently dominates another system in the most important inputs, 
then the ranking remains stable regardless of parameter errors. In such a case, the 
sensitivity of ranking with respect to parameter variations is low, and the confidence 
in ranking is high. In an opposite case, competitive systems can more significantly 
differ, and still provide a modest confidence in ranking. 

Our approach to the computation of confidence level has the following properties: 
(1) we use specialized generators of random weights and random andness/omess (and 
exponents) that are directly simulating the behavior of expert evaluators, (2) the 
variation ranges of parameters are experimentally verified, and (3) using uniform 
distributions instead of normal we add extra safety margin to our results. The 
application of the proposed confidence analysis method to a real life computer 
selection project generated a high confidence level and verified the resulting ranking. 
Therefore, we recommend including the analysis of confidence in all system 
evaluation models. In addition, a confidence analyzer must be a part of system 
evaluation tools. 
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Abstract. In the framework of aggregation by the discrete Choquet 
integral, an unsupervised method for the identification of the underlying 
capacity was recently proposed by the author in [1]. In this paper, an 
example of the application of the proposed methodology is given : in the 
absence of initial preferences, the approach is applied to the evaluation 
of students. 



1 Introduction 

Consider a finite set of objects O := {oi, . . . , o„} described by m cardinal at- 
tributes Ai,. . . ,Am defined on interval scales. Each object o G O is identified 
with its profile {a °, . . . , a^) G K™ where, for any i G {!,..., m}, a° is the value 
of attribute Ai for object o. For each object o G O, we shall further assume that 
the values are given on the same scale, which implies that all the 

attributes are commensurate. 

In numerous situations, it is useful to be able to associate with each object 
o G O one unique value resulting from the merging of the values a \, In 
order to perform this step, an aggregation operator is to be used ; see e.g. [2, 3]. 

In the presence of independent attributes, one of the most frequently used 
aggregation operator is the weighted arithmetic mean. The unique value W^(a) 
associated with a profile a = (oi, . . . , am) is then given by Wuj{a) := Efci 
where, for any i G {!,..., m}, iOi is the weight of attribute Ai with uji > Q and 

E m 

i=l — 1- 

The assumption of independence among attributes is however rarely verified. 
In order to be able to model interaction phenomena among attributes, it has been 
recently proposed to substitute to the weight vector w a monotone set function 
fj, on M := {A\, . . . ,Am\ allowing to model not only the importance of each 
attribute but also the importance of each subset of attributes ; see e.g. [4, 5]. 
Such a set function p,, called (discrete) Choquet capacity [6] or fuzzy measure [7], 
satisfies /i(0) = 0, /r(M) = 1 and p,{S) < fi{T) whenever S' C T C M. 

A suitable aggregation operator that generalizes the weighted arithmetic 
mean when the attributes interact is then the discrete Choquet integral with 
respect to the capacity p, ; see e.g. [4, 5, 8]. 



V. Torra and Y. Narukawa (Eds.): MDAI 2004, LNAI 3131, pp. 163-175, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 



164 Ivan Kojadinovic 



The use of a weighted arithmetic mean (resp. Choquet integral) as an aggre- 
gation operator first requires the definition of a weight vector u) (resp. a capacity 
fj.) . When the aggregation is to be carried out in the context of multicriteria de- 
cision making, additional information on the underlying problem given by a de- 
cision maker can be used to determine u> or This additional knowledge, called 
initial preferences by Marchant [9], generally consists of preferences over the 
objects, intuitions about the importance of the attributes and the relationships 
among them, etc. In such a context, the aim is to model the preferences of the 
decision maker by means of the aggregation function. When the Choquet integral 
is to be used as aggregation operator, there are several approaches that can be 
used to identify the capacity. More details can be found e.g. in [10, 11, 12, 13, 14]. 
In the sequel, we shall term these identification methods supervised. 

The term supervised refers to the fact that some prior knowledge {initial pref- 
erences) has to be provided in order to fully determine the aggregation operator. 
The following question then naturally arises : what if the required knowledge 
cannot be easily given or, simply, is not available ? 

With these considerations in mind, an unsupervised identification method 
of the capacity based on the estimation of the interaction among attributes by 
means of information-theoretic functionals was proposed by the author in [1]. 
The suggested approach mainly consists in replacing the subjective notion of im- 
portance of a subset of attributes by that, probabilistic, of information content of 
a subset of attributes, which can be estimated from the set of profiles. Although 
it clearly does not pertain to the field of multicriteria decision making since it is 
unsupervised, the proposed identification method could still be considered as an 
alternative to the existing approaches developed in [10, 11, 12, 13, 14] when the 
prior knowledge they rely on cannot be provided. From a practical perspective, 
a sufficiently large number of profiles is necessary to obtain accurate estimates 
of the capacity coefficients and therefore of the Choquet integral. 

This paper is organized as follows. First, the Choquet integral is briefly pre- 
sented in the framework of aggregation and numerical indices that can be used 
to interpret its behavior are given. Then, the unsupervised identification method 
proposed in [1] is summarized. Finally, the suggested approach is applied to the 
evaluation of 89 first year students in Mathematics and Physics from University 
of Reunion Island. 



2 Aggregation by the Choquet Integral 

In the framework of aggregation, the Choquet integral can be regarded as an 
extension of the weighted arithmetic mean taking into account the interaction 
phenomena among attributes. The expression “interaction phenomena” refers to 
complementarity or redundancy effects among attributes that can be modeled 
by the underlying capacity /i ; see e.g. [5]. The modelling of the interactions 
among attributes is obtained by defining, for each non empty subset of attributes 
S C M, its weight or its importance fJ.{S). The Choquet integral generalizes the 
weighted arithmetic mean in the sense that, as soon as the capacity is additive 
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(which intuitively coincides with the independence of the attributes), it collapses 
into a weighted arithmetic mean. More details can be found e.g. [4, 5]. 

The Choquet integral of a function a : M ^ K represented by the profile 
(oi, . . . , ttm) with respect to a capacity ^ on M can be defined by 

m 

Cm(«) := “ m(5(*+i))], 

i=l 

where the notation ( ) indicates a permutation such that ajij < ■ • • < a(m)) ^(i) 

■ {^(z) 7 • ■ • 7 ^(m) } : ^ ^ { f ; • ■ • : U7.} , and • d- 

An intuitive presentation of the Choquet integral can be found in [15]. Note 
that an axiomatic characterization of the Choquet integral as an aggregation 
operator has been recently proposed by Marichal [5]. 

3 Behavioral Analysis of the Aggregation 

The behavior of the Choquet integral as an aggregation operator is generally 
difficult to understand. For a better comprehension of the interaction phenomena 
modeled by the underlying capacity, several numerical indices can be computed. 
In the sequel, we mention three of them. More details can be found e.g. in [16]. 

The global importance of an attribute Ai can be measured by means of its 
Shapley value [17], which is given by 

E (m-|r|-l)!|T|! 

^ ' m! 

T'ZM\{Ai} 

The average interaction between two attributes Ai and Aj can be measured 
by means of their Shapley interaction index [18, 19] which can be computed by 

E u {A, AI) - /^(T U {A}) - ^J^{T u {A}) 

T<ZM\{Ai,Aj} ^ '' 



+KT)]. 

The last index we shall mention is the Marichal entropy of a capacity [20], 
which can be used to measure the average degree of utilization of a profile during 
the aggregation by the Choquet integral with respect to [21]. Denoted Hm{h), 
it is defined by 

- E E {A}) - m(T)] log[M(TU {A}) - 

i=l TCM\{Ai} 
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4 A Probabilistic View of the Identification Problem 



In the absence of initial preferences, the only available information is the set of 
profiles. In such an unsupervised context, as mentioned in the introduction, the 
problem of the identification of the capacity can be regarded as an estimation 
problem. Hence, with each attribute Ai is uniquely associated a random vari- 
able Xi such that, for any object o ^ O, the value a° is seen as a realization 
of Xi and, consequently, every profile a° = (a°, . . . , a^) is seen as a realization 
of the random vector {X \, . . . , X^)- 

In [I], it was suggested to define the weights of the non empty subsets of 
attributes by means of an entropy measure, that is, to replace the subjective 
notion of importance by that, probabilistic, of information content. In order 
to ensure that the resulting set function is monotone, it was further assumed 
that the random variables Xi , . . . , Xm are discrete and take their values in the 
finite sets fbi , . . . , Xm ; see also [22] . From a practical perspective, as we shall 
see in Subsection 5.2, this may require a prior discretization of the available 
profiles before the estimation of the capacity. The trivial situation where the 
joint probability distribution P(Xi,...,Xm) of V, ... , Xm is a Dirac mass was also 
excluded. Indeed, in this case, all the profiles would be necessarily equal and 
further aggregation would make little sense. The weight of every subset SQM 
of attributes was then defined by 



r 0, if 5 = 0, 



p{S) := < 



H{p(x,„...,x,q) 
H{p(x^,...,x^)) ’ 



if <S — {Aj, . . . , A^}, 



where H is an entropy measure [23]. Notice that, from the basic properties of 
entropy measures, H(p(^Xi,...,Xm)) ^ since it was assumed that P[Xx,...,Xm) 
not a Dirac mass [24]. 

It is easy to verify that p, and therefore the Choquet integral with respect 
to /i, depend only on the joint distribution P(Xi,...,Xm) of Xi, . . . ,Xm [1]. The 
distribution P(Xi,...,Xm) be in turn estimated from the available profiles that 
may have to be discretized if the attributes cannot be considered discrete. 

More technically, given a random sample Yi, . . . ,Yn of (Xi , . . . , Xm), a nat- 
ural estimator of the weight of a subset S = {4^^, . . . , C M is given by 



p{S) 



H{p{x^....,x^)) ’ 



( 1 ) 



where P[Xi,...,Xm) i® classical maximum likelihood estimator of P(Xi,...,Xm) 
defined, for any {x \, . . . , Xm) G Xi x ■ ■ ■ x Xm, by 



P{Xi,...,x„,){xi, • . • , Xm) 



1 . ^ 
i=i 
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where is the indicator function of the event {Y = (xi, 

. . . , Xm)}- It is easy to verify that, for any S = {Ai^ , . . . , Ai^ } C M, the estimator 
P(Xij^,...,Xi^) of can be immediately obtained from 

A natural estimator of the Choquet integral of a profile a = (oi, . . . ,Om) 
with respect to /i is then simply obtained by substitution, i.e. C^{a) = Cj^{a). 
The asymptotic properties of Cfi{a) where studied in [1]. 

Although many (parametric) entropy measures could be considered, in the 
sequel, H is chosen to be the well-known Shannon entropy [25] which comes 
down to measuring the interactions among the random variables Xi , . . . , Xm 
using the notion of mutual information [1, 23]. It is then easy to verify that fi, is 
a submodular capacity [22] which implies that /t can only model negative interac- 
tions (or redundancy effects) between two attributes. Note that such a behavior 
is natural in an unsupervised context, since, in order to detect positive inter- 
actions (or complementarity effects) between two attributes, initial preferences 
would be necessary. 

Before ending this section, let us give an interpretation of fi. By considering 
Eq. (1), we can see that the weight of a non empty subset of attributes directly 
depends on the uniformity of the corresponding estimated probability distribu- 
tion : roughly speaking, the more discriminative among the alternatives a subset 
of attributes is, the more uniform the corresponding estimated probability dis- 
tribution, the higher its weight, and reciprocally. 



5 Application to the Evaluation of Students 

In order to illustrate the application of the proposed identification method, let 
us consider the classical problem which consists in assigning global evaluations 
to students from their partial evaluations in different subjects. However, we shall 
here further assume that no initial preferences are available. The data correspond 
to marks of 89 first year students in Mathematics and Physics from University 
of Reunion Island for five subjects : English (Eng), Computer Science (Com), 
Algebra (Alg), Analysis (Ana) and Physics (Phy). The marks on a 0 to 20 scale 
are given in Table 1. For each subject, the minimum, maximum, average mark 
and the standard deviation of the marks are given in Table 2 as well as the linear 
correlation matrix of the data. 

The aim is to estimate the weights of the subsets of attributes (i.e. subjects) 
from the available profiles by means of the capacity fi previously defined and 
to compute the global evaluation of each student from its partial evaluations 
by means of the Choquet integral with respect to fi. In the absence of initial 
preferences, the most natural aggregation operator for this task would be the 
simple arithmetic mean. Thus, in a second stage, the behavior of the Choquet 
integral will be compared with that of the simple arithmetic mean. 



168 Ivan Kojadinovic 














































































































































































































































































































































































































































































































Unsupervised Aggregation by the Choquet Integral 169 



Table 2. Statistical summary of the available marks and correlation matrix 





Eng Com Alg Ana Phy 


Com Alg Ana Phy 




Minimum 


1.0 


1.0 


0.2 0.2 1.1 


0.15 0.32 0.17 0,18 


Eng 


Maximum 


18.6 


18.1 


14.5 10.0 13.1 


0.30 0.12 0.15 


Com 


Average 


10.6 


8.6 


6.1 5.1 6.1 


0.46 0.35 


Alg 


Standard deviation 


4.5 


4.4 


3.4 2.2 2.8 


0.15 


Ana 



5.1 The Problem of the Non-commensurateness 
of the Partial Evaluations 

Before estimating /i and the global evaluations of the students, it is fundamental 
to see whether the partial evaluations given in Table 1 can be considered as 
commensurate or not. The summary statistics given in Table 2 show that the 
marks in Mathematics and Physics are much lower on average than the marks 
in the other subjects. The rather large number of students suggests to consider 
that Mathematics and Physics are evaluated much more roughly then the other 
subjects and thus that the partial evaluations are not commensurate. In order to 
solve this problem, we state the following hypothesis : the 89 considered students 
form a representative sample of the student population. Under this hypothesis, it 
seems reasonable to consider that the available sample contains both very good 
and very bad students. We then suggest to linearly transform the available data 
such that, for each subject, the lowest mark be 0 and the highest 1. Although 
this may not be completely satisfactory, we shall assume in the sequel that the 
resulting partial evaluations are commensurate. 

5.2 Estimation of the Weights of the Subsets of Attributes 

In order to be able to estimate /i, here, it is necessary to first discretize the 
available profiles. Given the rather low number of profiles with respect to the 
dimension of the problem, we decide to first divide the domain of each attribute, 
i.e. the interval [0,1], in d = 6 classes : [0,l/6[, [l/6,2/6[, [2/6,3/6[, [3/6,4/6[, 
[4/6, 5/6 [, and [5/6, 1]. This is equivalent to considering that the associated dis- 
crete random variables can take only six different values. The influence of the 
parameter d on the estimation of the weights and on the Choquet integral will 
be studied in Subsection 5.5. 

Estimations of the weights of subsets can then be obtained using Eq. (1). 
Notice that, because of the way fi was defined, the weight of a non empty subset 
of subjects directly depends on the uniformity of the distribution of the marks for 
these subjects. In order to explain this point in more detail, consider the case of 
a subset reduced to a single subject. If most of the students have a similar mark 
for the considered subject, the weight of the subject will be low, which could be 
justified by the fact that it does not clearly discriminate between good and bad 
students. On the contrary, the more the marks are uniformly distributed, the 
higher the weight of the subject. The same reasoning can be applied to subsets 
of more than one subject. 
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Table 3. Estimated weights of the subjects 



Eng Com Alg Ana Phy 
0.38 0.38 0.38 0.38 0.36 



Table 4. Shapley importance indices and Shapley interaction indices between subjects 



Eng Com Alg Ana Phy 
0.21 0.22 0.19 0.19 0.19 



Com Alg Ana Phy 




-0.09 -0.08 -0.08 -0.08 


Eng 


-0.09 -0.07 -0.09 


Com 


-0.12 -0.07 


Alg 


-0.09 


Ana 



The estimated weight of each subject is given in Table 3. As one could have 
expected from the submodularity of /t, the sum of the weight of subjects is 
(much) higher than 1, which indicates redundancy among subjects ; see [22, 26] 
for more details. 

5.3 Behavioral Analysis of the Choquet Integral with Respect ft 

In order to study the behavior of the Choquet integral with respect /t, the Shapley 
importance indices of each subject were computed (cf. Section 3). These indices 
are given in Table 4. As one can notice, all the subjects have approximately the 
same global importance. 

The average interactions between subjects can be evaluated by computing 
their Shapley interaction indices (cf. Section 3). These indices are given in Ta- 
ble 4. Again, as one could have expected from the submodularity of (1, the in- 
teraction indices are all negative ; see e.g. [19]. By considering Table 4, one can 
see that the two subjects that interact the most (negatively) are Analysis and 
Algebra. This redundancy effect implies that a high mark (resp. low) in Analysis 
is usually followed by a high mark (resp. low) in Algebra and vice versa. 

Finally, the behavior of the Choquet integral with respect to the capacity 
fl can also be interpreted by means of the Marichal entropy of fi. As discussed 
in [21], the quantity HM{fi) can be seen as a measure of the average degree 
of utilization of a profile during the aggregation. The higher the closer 

the behavior of the Choquet integral to the simple arithmetic mean. On the 
contrary, the more disjunctive or conjunctive the Choquet integral is (i.e. close 
to the maximum or minimum resp.), the lower 

In order to have an index in [0, 1], Hm can be simply normalized by division 
by its maximum (log 5 for the considered problem). We obtain HmUj-) /^ ogb = 
0.84, which could be considered as satisfying. More details can be found in [16]. 

5.4 Estimation of the Global Evaluations 

Once the weights of non empty subsets of attributes are estimated, the global 
evaluations of the students can be computed by means of the Choquet integral 
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Students 



Fig. 1. Global evaluations computed by the Choquet integral with respect to jl 
(dashed line) and the simple arithmetic mean (continuous line) 



Table 5. Profile of the student designated best by the simple arithmetic mean (m) 
and profile of the student designated best by the Choquet integral (c) 



Eng Com Alg Ana Phy 



m 

c 



0.85 0.84 0.98 0.85 0.53 
0.83 0.88 1.0 0.09 1.0 



with respect to jl. These global evaluations are given in Figure 1 (dashed line). 
The continuous line corresponds to the global evaluations obtained by the simple 
arithmetic mean. 

By considering Figure 1, one can notice that the global evaluations computed 
by the Choquet integral with respect to /t are always superior to the simple 
arithmetic mean of the marks. This disjunctive behavior of the Choquet integral 
is due to the strong redundancy among subjects modeled by jl. 

In order to study the effects of the negative interaction phenomena among 
attributes modeled by ji, we compare the profile of the student designated best 
by the simple arithmetic mean with that designated best by the Choquet integral 
with respect to jl. By observing Figure 1, it appears that, in the first case, the 
best student is student number 55 in Table 1 and that, in the second case, the 
best student is student number 41. In the sequel, the former will be called m, 
the latter c. Their profiles are given in Table 5. 

By observing Table 5, one can see that student m has good results on average 
but that the marks of student c are globally superior, except in Analysis where 
her/his mark is extremely low. The fact that student c is designated better 
than student m by the Choquet integral can be explained by c’s high mark 
in Algebra and the disjunctive behavior of the Choquet integral due, amongst 
other things, to the strong negative interaction between Analysis and Algebra. 
In other terms, a high mark in Algebra or in Analysis is sufficient to significantly 
influence the global evaluation. In more academic terms, the extremely low mark 
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Fig. 2. Influence of parameter d on the aggregation by the Choquet integral with 
respect to fi 



of c in Analysis is interpreted as an “accident” in comparison to c’s other marks 
and especially her/his mark in Algebra. 

To conclude this subsection, we could say that, globally, the simple arithmetic 
mean can be considered as underestimating the students since it does not take 
in account the redundancy effects among subjects. 

5.5 Influence of the Parameter d on the Aggregation 

Before ending our study, we synthetically present the results obtained for other 
values of the discretization parameter d. Recall that d corresponds to the num- 
ber of subdivisions of [0,1]. We have therefore estimated the capacity and 
computed the global evaluations of the 89 students for d = 2, 4, 6, 8 and 10. 
The obtained results show that the larger d, the more disjunctive the Choquet 
integral. In order to illustrate this behavior, the global evaluations computed by 
the Choquet integral with respect to for d = 2 and d = 10 are compared with 
the simple arithmetic mean of the marks. In Figure 2 (a), the global evaluations 
computed by the Choquet integral for d = 2 (dashed line) are compared with 
the simple arithmetic mean of the marks (continuous line). As one can notice, 
the two curves are almost superimposed. To our opinion, this is due to the fact 
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that the low value of parameter d does not enable to highlight the dependen- 
cies among subjects. Figure 2 (b) shows a similar comparison for d = 10. In 
this case, the redundancy effects among subjects seem to have been taken into 
account since the Choquet integral with respect to (1 shows a highly disjunc- 
tive behavior. This observation is strengthen by the evolution of the values of 
the Shapley interactions indices between subjects and of the Marichal entropy 
of against parameter d, as can be noticed from Figures 2 (c) and 2 (d) re- 
spectively. Indeed, the higher d, the higher the redundancy between subjects 
and the lower HmUj-)- However, the higher d, the larger the sample size nec- 
essary to obtain accurate estimates of P{Xi,...,Xm)- Indeed, as all multivariate 
statistical methods, the proposed approach suffers from the so-called curse of 
dimensionality. Hence, spurious redundancy effects could appear as d increases. 
This phenomena is illustrated in Figure 2 (c) : the higher d, the higher and more 
homogeneous the Shapley interaction indices. 

6 Conclusion 

A practical application of the unsupervised Choquet integral based aggregation 
method proposed in [1] has been presented. In the absence of initial preferences, 
the suggested methodology could be considered more insightful than arbitrary 
parametrized weighted arithmetic mean based approaches that cannot not take 
redundancy effects among attributes into account. From a practical point of view, 
the proposed methodology might be used in several fields where information 
fusion is necessary [3]. Proceeding like Marichal et al. [27], it could also be used in 
an ordinal context. From an operational perspective, a sufficiently large number 
of profiles is necessary to obtain accurate estimates of the capacity coefficients 
and therefore of the Choquet integral. Furthermore, in the case of continuous 
attributes, a strategy for the choice of the discretization parameter d would need 
to be investigated. 
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Abstract. Distortion of fuzzy measures is discussed. A special attention 
is paid to the preservation of submodularity and supermodularity, belief 
and plausibility. 



1 Introduction 

Modelling of uncertainty by means of probability measures is a powerful tool in 
many engineering, economical, medical, sociological, etc., applications. Though 
the success of probability theory is undoubtful, it has several limitations. One 
of them is hidden in the genuine property of probability measures - in the addi- 
tivity. Additivity of a set function excludes the modelling of interaction among 
subsets (singletons) present in many real situations. One of the first attempts to 
increase to the modelling potential of the probability theory is due to Medolaghi 
in 1907 [10] who proposed a distortion of probability measures replacing the ad- 
ditivity m{AU B) = m{A) +m{B) (for disjoint events A, B) by the A-additivity 
m{A \J B) = m(A) + m{B) + Xm{A)m{B) (still for disjoint events A, B). Note 
that these set functions were reintroduced and applied by Sugeno in 1974 [16] 
under the name A-measures, A € [— l,-|-oo]. Later generalizations of probability 
measures, such as submeasures or supermeasures, were investigated by many 
authors. For an overwiev and state-of-art of the modern measure theory we rec- 
ommend the handbook [13]. Note only that distorted probabilities (i.e. set func- 
tions f{P) with / a non-decreasing [0, 1] — > [0, 1] mapping, /(O) = 0, /(I) = 1) 
are commonly exploited in the game theory build by Auman and Shapley [1]. 
Another generalization of the probability theory is the Dempster-Shafer theory 
of evidence [6], [14] working with belief and plausibility measures. Recall that 
these set functions occure already in the work of De Fineti [-5] as lower and 
upper probabilities when extending a fixed probability measure from a (finite) 
subalgebra to a given cr-algebra. 

The aim of this paper is the discussion of distortions / preserving some of 
the distinguished generalizations of the probability measures. Note that it is ev- 
ident that the only distortion / such that f{P) is a probability measure for any 
probability measure P is the identity, i.e., f{x) = x for all x € [0, l].The paper is 
organized as follows. In the next section, we recall a general framework of fuzzy 
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measures (as the largest class of set functions generalizing probability measures 
we will deal with) . Distinguished subclasses of fuzzy measures will be character- 
ized by their respective properties. Distortions preserving these subclasses will 
be discussed in the subsequent sections. Finally, some conclusions are included. 

2 Fuzzy Measures 

Fuzzy measures play a key role in interactive decision making processing. For 
applications, several distinguished classes of fuzzy measures are important. Some 
of them are discussed and characterized, e.g., in [4], [11]. We will pay our at- 
tention to the classes of fuzzy measures characterized by some special property. 



Definition 1. Let X = {l,...,n},n € N, he a fixed universe (set of crite- 
ria). A mapping m : 'P(X) — > [0, 1] is called a fuzzy measure whenever m(0) = 
0, m{X) = 1 and for all A B X it holds m{A) < m{B). 

Distinguished classes of fuzzy measures are determined by their respective 
properties. 

Definition 2. A fuzzy measure m on X is called: 

1. Subadditive whenever 

VA, B e V{X),m{A UB)< m(A) m{B), 

2. Superadditive whenever 

VA, B € V{X),A n S = 0, m{A \JB)> m{A) + m{B), 

3. Submodular whenever 

\JA, B e V{X),m{A UB) + m{A f] B) < m{A) + m{B), 

4- Supermodular whenever 

VA, B G V{X),m{A U i?) -I- m{A B) > m{A) m{B), 

5. Belief whenever 

n n 

m(AiU- • -UAn) > ^ — ^ m{Aif]Aj)-\ \-{—l)'^'^^m{Aif] . . . nA„) 

i=l i<.j 

for arbitrary n G N and Ai, . . . , An € V{X), 

6. Plausibility whenever 

n n 

m{Air\- ■ -PAn) < m{Ai) — ^^ m{Ai\jAj)-\-- ■ . . . U7l„) 

2=1 i<.j 

for arbitrary n G N and Ai , . . . , An G V{X). 
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Observe that if a fuzzy measure is both submodular and supermodular, it is 
modular and thus a probability measure on X . For more details about the above 
properties (classes) of fuzzy measures we recommend [12], [13], [17]. Note also 
that the relationship of the above properties is discussed in [15]. 

Evidently, each belief is supermodular and each supermodular fuzzy measure 
is also superadditive. Similarly, each plausibility is submodular, and each sub- 
modular fuzzy measure is subadditive. Some of the above properties are dual 
one to another. 

Definition 3. Let m be a fuzzy measure on X. Its dual fuzzy measure is 
given by m‘^{A) = 1 — A G V{X). 

Now, it is evident that = m and that m is submodular if and only 

if m'* is supermodular. Similarly, m is belief if and only if is plausibility. 
Note that in the measure theory [13], the subadditive fuzzy measures are also 
called (normed) submeasures, while the superadditive fuzzy measures are called 
(normed) supermeasures. However, the subadditivity and the superadditivity are 
not more dual properties, i.e., a dual of a subadditive fuzzy measure m need 
not be superadditive, and vice-versa. 

3 Sub- and Supermodular Fuzzy Measures 

Let us denote by T the class of all functions / : [0,1] — > [0,1] which are non- 
decreasing and /(O) = 0,/(l) = 1. Similarly, let tFi = {f & T\ if m possesses 
property i then also f{m) fulfils i},i = 1, . . . , 6. Any of classes T is 
a closed convex subset of [0, 1]I'^’^1. Moreover, it is evident that for any f,gG 
respectively, fog g iF, iFi, . . . , iFg, respectively. 

We start first with some trivial observations. 

Lemma 1. A function f G if if and only if for any fuzzy measure m on X, 
f(rn) is also a fuzzy measure on X. 

The subadditivity f{m{AUB)) < f (rn{A)) + f (rn{B)) whenever m{AUB) < 
m{A) + m{B) is equivalent to the subadditivity of / G IF, i.e., f{x + y) < 
f{x) + f{y) whenever x,y,x + y G [0, 1]. 

Theorem 1. Let f G T. Then f G Ti if and only if f is subadditive. 

Similarly we can show the next result. 

Theorem 2. Let f G T . Then f G T 2 if and only if f is superadditive, i.e., if 
fix + y)> fix) + fiy) for all x,y,x + yG [0, 1] . 

Observe that the simultaneous subadditivity and superadditivity of a fuzzy 
measure implies its additivity. This fact is reflected by the equality H J -2 = 
{*<^[ 0 . 1 ]} where id[o,i] : [0, 1] ^ [0, 1] is given by id[o,i](a;) = x. 

Now we turn our attention to the supermodular fuzzy measures. At first, we 
need the next technical lemma. 
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Lemma 2. The following are equivalent for a non- decreasing function 

/: [ 0 , 1 ] ^[ 0 , 1 ]; 

1. f is convex, 

2. Vx, y, A G [0, 1] : f{\x + (1 - \)y) < \f{x) + (1 - A)/(j/), 

3. yx,y G [0,1] : /(^^) < (Jensen’s inequality), 

4- yx,y,e G [0,1], X < j/,x + £,j/ + £ G [0,1] : /(x + £)-/(x) < f(y + s)- f{y). 

From properties of supermodular fuzzy measures we immediately have the 
next result. 



Lemma 3. Let f € iF. Then f € iFs if and only if 

Va, b,c,d G [0,1], a < b < c< d,a-\- d> b-\- c f{a) + f{d) > f{b) + /(c). 

Now we are able to characterize supermodularity preserving functions more 
precisely. 

Theorem 3. Let f € iF. Then f & iFs if and only if f is convex. 

Proof. 



(i) 



(ii) 



Let / be a supermodular preserving function. In Lemma 3, put 
b = c=s±^, 0<a < d. Then /(a) + f{d) > f{2±^) + /(^^), i.e., 

y(o^) ^ /(a)+/(rf) Lemma 2 (Jensen’s inequality), / is convex. 

Let for some a,b,c,dG [0, 1], a < & < c < d and a + d > b + c we have 
/(o) + f{d) < f{b) + /(c). Put d* = b + c— a < d. Then for £ = b—ait holds 
that d* — c = s, a + d* = b + c, f{a) + f{d*) < f{a) + /(d) < f{b) + /(c) 
and hence /(c+ £) — /(c) < /(a + £) — /(a) violating (4) of Lemma 2, and 
then / cannot be convex. □ 



Consequently we have the next characterization of submodularity preserving 
functions. 



Corollary 1. Let f G T. Then f G if and only if f is concave. 

Proof. The result follows from the fact that / preserves submodularity of fuzzy 
measures if and only if its dual function/* : [0, 1] ^ [0, 1] given by /* = 1 — /(I — 
x) preserves the supermodularity. However, the convexity of /* is equivalent to 
the concavity of /. □ 

Note that IF3 n IF4 = {id[opj}, reflecting the fact that submodularity and 
supermodularity of a fuzzy measure m imply its additivity. 
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4 Belief and Plausibility Fuzzy Measures 

Now we discuss the preservation of the belief property of /(m). 

Theorem 4. Let f G J- 5 - Then for > 0 for all x s]0, 1[. 

Proof. Suppose that there exist k G N and xq G [0,1] such that < 0. 

We will prove the existence of fuzzy measure m, for which f(m) is not belief. 

Since k-th derivative of / is continuous, there exist an interval ]a,b[ such 
that xo G]a,b[ and f^^\x) < 0 for all x G]a,b[. We can choose rational num- 
bers xi,X2 such that xq < xi < X2 < b. Due to [8] it holds 

^Aj(xi) < 0 for arbitrary positive step h with kh + x\ < b. (1) 

We notice that 



'‘zl^(x) = ^(-l)'=-* f{x,+^h) 

i=0 ^ ' 

is k-th difference of / in xi with the step h. 

Let h = , than x\,X 2 ,h are rational numbers and we can express them 

by quotients with joint denominator xi = |,X 2 = ^,/i=|. 

Let X = {1,2, ... ,q} and m : V{X) [0,1] , m{A) = for all A G 
V{X), m is obviously belief fuzzy measure. In order to composed set function 
fj, = /(to) be belief, it must fulfil the condition (5) in Definition 2. 

Let 

A= {l,2,...,p+ks}, 

Ai = A — [1,2,. . . , s}, 

A2 = A-[1,2,...,2s}, 



Ak = A — {{k — l)s -I- 1, . . . , fcs}. 

Obviously {J A, = A, p{A) = f{m{A)) = f = f{xi + kh). 

fi{A,) = /(m(4d) = / + (fc _ i)h), 

i G {1,2,. ..,k}. 



\Ai n Aj 



= f 



p+ {k- 2)s 



fj.{Ai nAj) = f 



/(xi + {k- 2)h), i < j, 
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i,j G {1, 2, . . . , /c} and finally 



/ k \ 

n ^ ' 



i=l 



= f\-q)= 



V 

Condition (5) in Definition 2 for our sets Ai, . . . , G 'P(X) has the form 
f{xi + kh)-k.f{xi + {k-l)h)+ f{xi + {k-2)h) h(-l)'' f{xi) > 0 



that is Mj(xi) > 0 contradicting (1). □ 

Note that the product of two belief fuzzy measures on X is again a belief fuzzy 
measure on X , see, e.g. [9]. Thus, by induction, m" is a belief fuzzy measure, i.e., 
the function p„ : [0,1] ^ [0, 1] given by Pn(x) = x" preserves belief fuzzy mea- 
sures, Pn G IF5. Due to the convexity of IF5, any polynomial / = ^ 

preserves belief fuzzy measures. Due to closedness of IF5, any 
infinite series / = Yl°^=i^nPmO'n > = 1 is also an element of IF5. 

1 if X — 1 

Moreover, the minimal element /* G T,f^{x) = otherwise’ 

also an element of Summarizing all above facts, we have the next conjecture. 

Conjecture 1. Let f G if. Then f G ip 5 if and only if f = \p + {\ — A)/* for 
some A G [0, 1] andp a finite or infinite polynomial with non-negative coefficients 
such that p G T. 



For the plausibility fuzzy measures we have the next result. 

Corollary 2. Let f G J-Q. Then for all k G N it holds {—l)^~^^f^^\x) > 0 for 
all X G]0, 1[. 



Proof. The result follows from the fact that / preserves plausibility of fuzzy 
measures if and only if its dual function /* : [0,1] ^ [0, 1] given by /* = 1 — 
f{l — x) preserves the belief. Due to Theorem 4 for all fc G it holds /*(^) (x) > 0 
for all fc G and x g] 0, 1[. □ 

Conjecture 2. Let f G T. Then f G Tq if and only if f = Xp {I — A)/* for 

Q jr 2 ; = 0 

some A G [ 0 , 1 ], f* the maximal element of T given by f*{x) = {^’ Qihgj-y^igg’ 
and p G T a polynomial with alternating signs of derivatives. 



5 Conclusion 

We have characterized distortions preserving subadditive, superadditive, sub- 
modular or supermodular fuzzy measures, respectively. We have described a huge 
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subset of belief preserving distortions, conjecturing that our description is com- 
plete (similarly for the preservation of plausibility). In our future research, we 
aim to prove (or disprove) Conjectures 1 and 2. Moreover, any aggregation 
m = A(mi,m 2 ) of fuzzy measures mi, m 2 on X by means of an aggregation 
operator [2], [3], yields a fuzzy measure m on X. Our next aim is the com- 
plete characterization of aggregation operators preserving the above discussed 
distinguished classes of fuzzy measures. Recall here that the only aggregation 
operators preserving the class of probability measures are the convex sums (i.e., 
weighted means in the terminology of [2], [3]), which, indeed, preserve any of the 
discussed classes of fuzzy measures. 

Acknowledgement 

This work was supported by Science and Technology Assistance Agency un- 
der the contract No. APVT-20-023402. A partial support of the grant VEGA 
1/0273/03 is also kindly announced. 

References 

[ 1 ] Aumann, R. J., Shapley, L. S.: Values of Non- Atomic Games. Princeton University 
Press, Princeton (1974) 175 

[2] Calvo, T., Kolesarova, A., Komormkova, M., Mesiar, R.: Aggregation Operators: 
Properties, Classes and Construction Methods. In [3], 3-104 181 

[3] Calvo, T., Mayer, G., Mesiar, R.: Aggregation Operators. Physica-Verlag, Heidel- 
berg (2002) 181 

[4] Chateauneuf, A., Jaffray, J. Y.: Some characterizations of lower probabilities and 
other monotone capacities through the use of Mobius inversion. Math. Social Sci. 
17 (1989) 263-283 176 

[5] De Fineti, B.: La prevision, ses lois logiques et ses sources subjectives. Ann. Inst. 
H. Poincare 7 (1937) 1-68 175 

[6] Dempster, A. P.: Upper and lover probabilities induced by a multivalued mapping. 
Ann. Math. Stat. 38 (1967) 325-339 175 

[7] Denneberg, D.: Non-additive Measure and Integral. Kluwer Acad. Publ., Dor- 
drecht (1994) 

[8] Dzjadyk,V. K.: Vvedenie v teoriju ravnomernogo priblizenia funkcij polinomami. 
Nauka, Moskva (1977) 179 

[9] Kramosil, I.: Probabilistic Analysis of Belief Functions. IFSR Int. Series on Sys- 
tems Science and Engineering, Vol. 16, Kluwer Academic Publishers, New York 
(2001) 180 

[10] Medolaghi, P.: La logica matematica e il calcolo delle probabilita. Bollettino 
dell’Associazione Italiana per I’lncremento delle Scienze Attuariali, vol. 18 (1907), 
20-39 175 

[11] Mesiar, R.: k-order additivity and maxitivity. Atti Sem. Math. Phys. Univ. Mod- 
ena 51 (2003) 179-189 176 

[12] Pap, E.: Null-additive set functions. Ister Science Bratislava & Kluwer Academic 
Publishers, Dordrecht-Boston- London (1995) 177 



182 



Eubica Valaskova and Peter Struk 



[13] Pap, E., ed.: Handbook on Measure Theory. Elsevier, Amsterdam (2002) 175, 

177 

[14] Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, 
Princeton (1976) 175 

[15] Stupnanova, A., Struk, P.: Pessimistic and optimistic fuzzy measures on finite 
sets. MaGiA 2003, Bratislava, Slovakia, (2003) 94-100 177 

[16] Sugeno, M.: Theory of Fuzzy Integrals and Applications. Ph.D. Thesis, Tokyo 
Inst, of Technology, Tokyo (1974) 175 

[17] Wang, Z., Klir, G.: Fuzzy measure theory. Plenum Press, New York and London 
(1992) 177 



Decision Modelling Using the Choquet Integral 



Yasuo Narukawa^ and Toshiaki Murofushi^ 



^ Toho Gakuen, 3-1-10 Naka 
narukawa@fz.dis . titech. ac . jp 

^ Department of Computational Intelligence and Systems Science 
Tokyo Institute of Technology, 4259 Nagatuta, 
Midori-ku, Yokohama 226, Japan 
murofusi@fz.dis .titech. ac . jp 



Abstract. The usefulness of the Choqnet integral for modelling decision 
under risk and uncertainty is shown. It is shown that some paradoxes 
of expected utility theory are solved using Choquet integral. It is shown 
that Choquet expected utility model for decision under uncertainty and 
rank dependent utility model for decision under risk are respectively 
same as their simplified version. 

Keywords: Fuzzy measure. Non-additive measure, Choquet integral, 
Decision under uncertainty. Decision under risk 



1 Introduction 

About the decision theory under risk and uncertainty, the expected utility the- 
ory by von Neumann and Morgenstern [-5] is well known. However, in recent 
years the counterexample that human’s decisions do not follow the expected 
utility theory is reported in various literatures. The Choquet integral with re- 
spect to non additive set function, which is called with various names, (e.g. fuzzy 
measure, non-additive measure, capacity, non-additive subjective probability) is 
a basic tool for modeling of decisions under risk and uncertainty. We can explain 
the famous paradoxes, that is, Allais paradox [1] and Ellsberg’s paradox [6] by 
using Choquet integral model. To explain Allais paradox for decision under risk, 
rank dependent utility model are proposed by Quiggin [10]. For Ellsberg’s para- 
dox, which is relevant to decision under uncertainty, Choquet expected utility 
model is proposed by Schmeidler [12]. After that, the simplified version in which 
the utility function is not used, is proposed by Chateauneuf [2]. Narukawa et 
al. [8] categorize some functional in the framework of decision making, using the 
simplified version. 

In this paper we show the usefulness of the Choquet integral with respect 
to a fuzzy measure for modelling decision under risk and uncertainty. We define 
the Choquet-Stieltjes integral and show that the Choquet-Stieltjes integral is 
represented by a Choquet integral. As an application of theorem above, we show 
that Choquet expected utility model (resp. rank dependent utility model) is same 
as its simplified version. We show that paradoxes of expected utility theory are 
explained, using the Choquet integral. 



V. Torra and Y. Narukawa (Eds.): MDAI 2004, LNAI 3131, pp. 183—193, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The structure of the paper is as follows. In Section 2, we define the fuzzy 
measure and Choquet integral and show their basic properties. We present the 
Choquet integral representation theorem of comonotonically additive functional. 
In Section 3, we introduce two famous counterexamples of classical expected 
utility models and define some non-expected utility models. In Section 4 we 
define the Choquet-Stieltjes integral and show that the Choquet-Stieltjes integral 
is represented by the Choquet integral with respect to another fuzzy measure. As 
the corollary of the theorem, we show that the Choquet expected utility model 
(resp. rank dependent utility model) is same as its simplified version. In Section 
5, we show that the simplified version is sufficient to explain the paradoxes 
mentioned above. We finish with concluding remarks in Section 6. 

2 Fuzzy Measure and Choquet Integral 

In this section, we define a fuzzy measure and the Choquet integral and show 
their basic properties. 

Let S' be a universal set and S be a— algebra of S, that is, (S, S) be a mea- 
surable space. 



Definition 1 [14] Let (S, S) be a measurable space. A fuzzy measure fi is 

a real valued set function, /i : S — > with the following properties; 

( 1 ) = 0 

(2) ^(A) < whenever A C B, A,B G S. 

We say that the triplet (S, S, /i) is a fuzzy measure space if /i is a fuzzy 
measure. 

B{S) denotes the class of non-negative measurable functions, that is, 

J-{S) = {/[/ : S — > f : measurable} 



Definition 2 [3, 7] Let (S, S, fi) be a fuzzy measure space. The Choquet 

integral of / G tF{S) with respect to /x is defined by 

pOO 

fdn= / tif{r)dr, 

Jo 

where Hf{r) = /x({x|/(x) > r|). 

Suppose that S' = {1,2, . . . ,n}. The *— th order statistic [15] is a func- 
tional on [0, 1]" which is defined by arranging the components of x = (xi, • • • , x„) 
G [0,1]" in the increasing order 




x(^) < ••• <xW < ••• <x("^ 
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Using the i—th. order statistics, the Choquet integral is written as 



„ n 



where we define := 0. 



Definition 3 [4] Let f,g G ^{S). We say that / and g are comonotonic if 



for X, x' € S. 

Definition 4 Let / be a real- valued functional on T{S). We say I is comonoton- 
ically additive if and only if + g) = I{f) + I{g) for comonotonic f,g G iF{S), 
and I is monotone if and only ii f < g ^ I{f) < I{g) for f,gG tF{S). 

Next we present that the comonotonically additive functional I on T which 
satisfies a less restrictive condition than monotonicity can be represented by the 
Choquet integral. 

Definition 5 [9] We say that a functional on J-{S) is comonotonic monotone 
if f g implies I{f) < I{g) for comonotonic f^g G iF{S). 

In the following we suppose that the functional I on tF{S) is comonotonically 
additive and comonotonic monotone (for short c.a.c.m.). The next theorem is 
less restrictive version of Schmeidler’s representation theorem [12]. 

Theorem 6 I is a c.a.c.m. functional on T if and only if there exists a fuzzy 
measure g such that 



for all f gT{S). 

3 Decision under Risk and Uncertainty 

In this section we present frames for decision under risk and uncertainty and 
paradoxes that the classical expected utility theory fails. Next we present the 
definitions of the Choquet expected utility and the Rank dependent utility. 

Let S' be a state space and X be a set of outcomes. We assume that outcomes 
are monetary. Therefore we may suppose X G R. We mean, by ’’decision under 
uncertainty”, situations when there does not exist a given objective probability. 
In decision under uncertainty, we consider the set of function / from S to X: we 
say the function / the act. tF denotes the set of acts, that is. 



fix) < fix') g{x) < g{x') 




186 



Yasuo Narukawa and Toshiaki Murofushi 



^ denotes the weak order on T . We say that the quadruplet (S', X,T is the 
frame for decision under uncertainty. 

In contrast with decision under uncertainty, by decision under risk, we mean 
situations when there exists a objective probability on S. That is, in decision 
under risk, we consider the set V of probability on S and the set of function / 
from S to X, which is called the random variable. The set T of acts is same as 
the set of random variable. ^ denotes the weak order on T . We say that the 
quintuplet {S,X,V,^F^ is the frame for decision under risk. 

Example 1 (St. Petersburg game) A fair coin is tossed until a head appears. If 
the first head occurs at the n-th toss, the payoff is 2" Euros. 

In this case, S := {1, 2, . . . , n, . . . } is the number of times tossed the coin 
until a head apperas. X is a set of the amount of money. As the act f , if the 
number of times is n G S, one can obtain 2" G X Euros, that is, f{n) = 2". The 
probability P is defined by P(n) = 1/(2”). Suppose that you own a title for one 
play of the game. What is the least amount you would sell your title for. The 
expected pay off Ep(f) is 



OO 

Ep(/) = ^2”x (l/2”) = oo. 

n— 1 

However most people would sell title for a relatively small sum. 



To explain the paradox, Cramer considered that ,in practice, people with 
common sense evaluate money in proportion to the utility they can obtain from 
it. That is, he considered a utility function u : R ^ R and expected utility 

OO 

= X (1/2”) <oo. 

n—1 

Von Neumann and Morgenstern [.5] proposed the axioms for preference repre- 
sented by expected utility. 

The next example is a famous paradox that fails the expected utility theory. 



Example 2 (Allais paradox [1]) Let the state space S := {1, 2, . . . , 100} and 
the outcome X := {0,100,200}. Define random variables /i,/ 2,/3 and /4 by 



fi{x) := 100 for all x G S, f 2 {x) 



200 1 < X < 70 
0 71 < X < 100, ’ 



h{x) ■= I 

and fffx) : 



100 1 < X < 15 
0 16 < X < 100 

_ r 200 1 < X < 10 
“10 ll<x<100. 



fi means that you always get 100 dollars. f 2 



means that you will get 200 dollars if you take a number 1 to 70 and you get 
none if you take a number 71 to 100. It is reported that most people choose fi, 
that is, f 2 ^ fi- In the same way ff means that you will get 100 dollars if you 
take number 1 to 15, and fi means that you will get 200 dollars if you take the 
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number 1 to 10. It is reported that most people chose fn, that is, /s ^ fi If 



the probabilities are all the same, i.e. P{i) 



100 



for all i G S, this preference 



cannot be represented by expected utility. In fact, suppose that there exists a utility 
function u : X — > R~^ such that f ^ g E(u{f)) < E(u{g)), where E{-) is the 
classical expectation. Since f 2 -< fi, we have 0.7u(200) < w(lOO). On the other 
hand, it follows from f^ ~< f^ that 0.15it(100) ^ 0.1 m(200). Therefore we have 
1.05tt(100) < 0.7u(200) < tt(lOO). This is a contradiction. 



The next example is Ellesberg’s paradox [6] for decision under uncertainty. 



Example 3 (Ellesberg’s paradox) 

Consider the red and black and white ball in the urn. The number of red is 30, 
black and white is 60. The number of Black is unknown, fn means that you 
will get 100 dollars only if you take red ball and fs means that you will get 100 
dollars only if you take black ball. Most people select fji because there may be 
a few black ball in the urn that is, fs -< fn. fnw means that you will get 100 
dollars if you take red or white ball and fsw means that you will get 100 dollars 
if you take black or white ball. Most people select fsw because there may be a few 
white ball in the urn, that is, fnw ^ fsw- This preference cannot be explained 
by the expected utility theory. In fact, let the state space S := {R, B,W} and the 
set of outcome X := {0,100}. The acts fn, fs, fnw BL'nd fsw defined by 
the table below. 





30 


60 




Red 


Black 


White 


Ir 


$ 100 


0 


0 


fs 


0 


$ 100 


0 


Irw 


$ 100 


0 


$ 100 


fsw 


0 


$ 100 


$ 100 



Suppose that there exists a subjective probability P such that f < g ^ E{u{f)) < 
E{u{g)). It follows from fnw fsw that u{100)P{B) = u{100){P{BW) — 
P{W)) > u{W0){P{RW) - P(W)) = u{W0)P{R). This contradicts fs -< fn. 

To solve those paradoxes, the model using Choquet integral with respect to 
fuzzy measure has been proposed; that is, the Choquet expected utility model 
for decision under uncertainty and the rank dependent utility model for decision 
under risk. First we define the Choquet expected utility model (CEU) for decision 
under uncertainty, that is introduced by Schmeidler [13]. 



Definition 7 Consider the frame of decision under uncertainty. The Choquet 
expected utility model stipulates that the decision maker ranks act f with the help 
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of a utility function u \ R ^ R, which is continuous and strictly increasing. The 
ranking Cu,ti of acts f is performed through 

CuAf) ■■= (C) I Af)dti, 

that is, f ^ g ^ Cu,fj,{f) < Cu,fi{g), where is a fuzzy measure (capacity). 

Next we define the Rank dependent expected utility model (RDEU) by Quig- 
gin [10]. 

Definition 8 Consider the frame of decision under risk. A decision maker be- 
haves in accordance with the rank dependent expected utility model if the decision 
maker’s preferences -< are characterized by two functions u and w: a continuous 
and strictly increasing function u : R R and a probability distorting func- 
tion w such that f >- g ^ J{f) > J{g) where J{h) := (C) / u{h)d{w o P) for 
a random variable h. 

The next section we will present the solution of the paradoxes using simplified 
CEU and RDEU model. 

4 Choquet-Stieltjes Integral 

In this section we define the Choquet-Stieltjes integral and show that Choquet- 
Stieltjes integral is represented by Choquet integral. Applying the theorem 
above, we show that the CEU and the RDEU are same as their simplified version 
by Chateauneuf [2]. 

Definition 9 Let {S,S, /r) be a fuzzy measure space and ip : be a non- 

decreasing real valued function. Then we can define Lebesgue-Stjeltjes measure 
v,p [11] on real line by 



b]) := ip{b + 0 ) - ip{a - 0 ) 
iZipHa, b)) := if{b - 0 ) - (p{a -\- 0 ). 

We define Choquet- Stieltles integral CS^Af) respect to p by 

poo 

CSf^Af) ■= / l^fir)diy,^{r), 

Jo 

where p,f{r) = p{{x\f{x) > r}). 

If the space S = {1,2,..., n}, Using the i—th order statistics, the Choquet- 
Stieltjes integral is written as 

n 

CS^Af) = ■ ■ ■ in)})- 
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Suppose that s € J^{S) is a simple function, that is, 



n 






0 = tq < ri < ■ ■ ■ < Tn and ^ A 2 ^ ... ^ The Choquet-Stieltjes integral 
of s is written as 



n 






Let / be non-negative measurable function, then there exists a sequence of 
simple functions such that Sn t /• Then we have 



Since Choquet-Stieltjes integral is comonotonically addtive and comonotoni- 
cally monotone, applying the representation theorem (Theorem 6), we have the 
next theorem. 

Theorem 10 Let (S', S, fi) be a fuzzy measure space and ip : ^ he a non 

decreasing function. There exists a fuzzy measure such that 



that is, the Choquet-Stieltjes integral can be represented by Choquet integral. 

Proof. Let non-negative measurable function / and g be comonotonic. Then 
there exist sequences of simple functions and such that Sn T /> tn T <7 
and Sn and tn are comonotonic. Since / and g are comonotonic, for every a,b > 0 
{x\f{x) > a } C {x\g{x) > b} or {x\f{x) > a } D {x\g{x) > b}. Therefore we 
may write 



CS^,p{f) = lim CS^,p{sn). 




m m 




0 = ro < n < ■ ■ ■ < r„,0 = r'o < r[ < ■ ■ ■ < r'^ 




Therefore we have 



n 



n 



CSp,^,p{sn) + CSf_,^p,{tn) = - ip{ri_i)p{Ci^n) + 



V?(r'_i)/i(C'*,„) 



n 



^((/?(ri) - p{r^-i) -h ((/?(r') - (p{ri_^)p.{Ci^n) 
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Since s™ + in t / + j we have 



+ g) = lim CSfj,^^{sn + tn) 

n — K30 

= lim C S C S 

n — K30 

= CS^,M)^CS^,^{g). 

Therefore is comonotonically additive. Suppose that f < g. We may 

suppose that Sn < tn- Since (p is monotone, 

n 

2^1 

n 

= X! - Ai(Cz+l.n)) 

2^1 

n 

< ^<^(r')(A^(C.„) - A^(C+i,„)) = CS^M- 
2^1 

Therefore we have 



CS^,^{f) = lim 

n — *OQ 

< lim CSf_i^^{tn) = CSf_,^^{g), 

n — *oo 

that is, CS^^ip is comonotonically monotone. Then it follows from the represen- 
tation theorem 6, that there exists a fuzzy measure such that 

= (C) I fdi^^,p. 

□ 

Suppose that <p is strictly increasing. 

Since {x\f{x) > </?“^(a)} = {x\ip{f{x)) > a}, we have the next corollary. 

Corollary 11 Let (S,S,g) be a fuzzy measure spaee and ip : — > i?+ be 

a strictly increasing function. Then there exists a fuzzy measure such that 

{C) J p{f)dg = (C) J fdv^,p. 

Proof. Let / S tP{S). Denote a real valued function pf ■. Rhy 
fj.f{a) = fi{{x\f{x) > a}). Then we have 

T{{x\Tif{x)) > a}) = g{{x\f{x) > i^"^(a)}) = gf{p~'^{a)) 

for a > 0. 

Since fif{p~^{a)) is non decreasing, there exists {ak,n} C R such that 

n 

g.f{p~^{a))da = lim V' 2i/(¥’~^(afc.n))(afc.n - afc-i,™)- 

k^l 
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Let tk,n ■= ^ ^(afc.n)- We have 



n 



n 



^ ^ f^f (^/c,n)) (^A:,n l,n) — ^ ^ /^/ (^/e,n ) (v^(^A:,n) l,n)) ■ 




Therefore we have 



'OO 



(f{f)da= / ///((/? ^(a))da 



0 



n 



n 



= lim y^ Uf{(p ^{ak,„))iak,n - ak-i,n) 



n 



= lim Jif{{tk,n))W{tk,n) - ^{tk-l,n)) 
n. — *ao ^ 



= J 



where v^p is Lebesgue-Stieltjes measure generated by (p. Then applying Theo- 
rem 10, there exists a fuzzy measure such that 



□ 

The corollary above means that the CEU (resp. the RDEU) are same as its 
simplified version. 

5 Solution of Paradoxes 

Using the simplified version, we can solve both Allais’ and Ellsberg’s paradoxes. 
(1) (St. Petersburg’s game) 

Let tc : [0, 1] ^ [0, 1] be a probability distortion function such that 



(2) (Allais paradox) 

We can define the probability distortion function re : [0, 1] ^ [0, 1] such that 
w(O.l) = 0.08, w(0.15) = 0.1 w(0.7) = 0.45 and rc(l) = 1. Then it follows 
from the Choquet integral with respect to the fuzzy measure p := w o P 
that 




w{pi H \-Pn) - w{pi H h Pn-l) < {PnT 



for (a > 1). 

Then w o P is a, fuzzy measure (distorted probobility) . We have 



OO 



(7^(2”) = y2 2"('w(pi H \-Pn) - w{pi H h p„-i)) < OO. 



C^(/i) = 100 > C^(/ 2 ) = 90 and Cp{h) = 10 < = 16. 
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(3) (Ellsberg’s paradox) 

We may define the fuzzy measure such that 

^l{{R}) := 1/3, = fi{{W}) := 2/9, 

(.{{R, W}) := 5/9, W}) = B}) := 2/3 

and fi{{R,B,W}) = 1. 

Then we have the Choquet integral of /* by the table below. 



/* 


fn 


/s 


fnw 


fsw 


C,{R) 


1/3 


2/9 


5/9 


2/3 



The above table says that C^ifs) < C^ifn) and C^{fRw) < C^{fBw)- 
Then there is no contradiction. 

6 Conclusion 

We define the Choquet-Stieltjes integral and show that the Choquet-Stieltjes 
integral is represented by a Choquet integral. As an application of theorem 
above, we show that Choquet expected utility model (resp. rank dependent util- 
ity model) is same as its simplified version. The simplified version is sufficient to 
explain the paradoxes for expected utulity theory. 

To obtain the simplified version model,that is, the preference on the acts is 
the input and the inequality of the Choquet integral is output , we only identify 
a fuzzy measure, the former CEU or rank dependent model needs to identify 
a fuzzy measure and a utility function. In this paper we show that the simplified 
version has the same descriptive ability as the former models. 
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Abstract. The reaching of consensus in group decision-making (GDM) 
problems is a common task in group decision processes. In this contri- 
bution, we consider GDM with linguistic information. Different experts 
may have different levels of knowledge about a problem and, therefore, 
different linguistic term sets (multi-granular linguistic information) can 
be used to express their opinions. 

The aim of this paper is to present different ways of measuring con- 
sensus in order to assess the level of agreement between the experts in 
multi-granular linguistic GDM problems. To make the measurement of 
consensus in multi-granular GDM problems possible and easier, it is nec- 
essary to unify the information assessed in different linguistic term sets 
into a single one. This is done using fuzzy sets defined on a basic linguistic 
term set (BLTS). Once the information is uniformed, two types of mea- 
surement of consensus are carried out: consensus degrees and proximity 
measures. The first type assesses the agreement among all the experts’ 
opinions, while the second type is used to find out how far the individ- 
ual opinions are from the group opinion. The proximity measures can be 
used by a moderator in the consensus process to suggest to the experts 
the necessary changes to their opinions in order to be able to obtain the 
highest degree of consensus possible. Both types of measurements are 
computed in the three different levels of representation of information: 
pair of alternatives, alternatives and experts. 

Keywords: Consensus, multi-granular linguistic information, group de- 
cision-making, linguistic modelling, fuzzy preference relation 
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1 Introduction 

A group decision-making (GDM) problem may be defined as a decision situation 
where: i) there exist two or more experts that are characterized by their own 
perceptions, attitudes, motivations and knowledge, ii) there exists a problem to 
be solved, and iii) they try to achieve a common solution. 

Fuzzy sets theory has proven successful for handling fuzziness and modelling 
qualitative information [6, 7, 13]. In this theory, the qualitative aspects of the 
problem are represented by means of “linguistic variables” [14], i.e., variables 
whose values are not numbers but words or sentences in a natural or artificial 
language. 

An important parameter to determine in a linguistic context is the “granu- 
larity of uncertainty”, i.e., the cardinality of the linguistic term set that will be 
used to express the information. Because experts may come from different re- 
search areas, and thus have different levels of knowledge, it is natural to assume 
that linguistic term sets of different cardinality and/or semantics could be used 
to express their opinions on the set of alternatives. In these cases, we say that 
we are working in a multi-granular linguistic context [4, 12], and we will call this 
type of problem a multi-granular linguistic GDM problem. 

In GDM problems there are two processes to carry out before obtaining 
a final solution [3, 5, 8, 9]: the consensus process and the selection process (see 
Figure 1). The first one refers to how to obtain the maximum degree of consensus 
or agreement between the set of experts on the solution set of alternatives. 
Normally this process is guided by the figure of a moderator [5, 9]. The second one 
consists of how to obtain the solution set of alternatives from the opinions on the 
alternatives given by the experts. Glearly, it is preferable that the set of experts 
reach a high degree of consensus before applying the selection process. In [4], 
the selection process for multi-granular linguistic GDM problem was studied. 
Therefore, in this paper, we focus on the consensus process, and in particular we 
address the problem of how to measure the consensus in such a type of GDM 
problem. 

Traditionally, the consensus process is defined as a dynamic and iterative 
group discussion process, coordinated by a moderator, who helps the experts to 
bring their opinions closer. In each step of this process, the moderator, by means 
of a consensus measure, knows the actual level of consensus between the experts 
which establishes the distance to the ideal state of consensus. If the consensus 
level is not acceptable, i.e., if it is lower than a specified threshold, then the 
moderator would urge the experts to discuss their opinions further in an effort 
to bring them closer [2, 15]. 

The aim of this paper is to present two different measurements to assess the 
level of agreement between the experts in multi-granular GDM problems. These 
measurements can be classified into two types: 

a) Consensus degrees to identify the level of agreement among all experts and 
to decide when the consensus process should stop. 
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GROUP DECISION MAKING 



Fig. 1. Resolution process of a group decision-making problem 



b) Proximity measures to evaluate the distance between the experts’ individ- 
ual opinions and the group or collective opinion. The proximity values are 
used by the moderator to guide the direction of the changes in the experts’ 
opinions in order to increase the degree of consensus. 

For each one of these measurements, it is interesting not only to know the 
global agreement or proximity amongst experts’ but also the partial degrees on 
a particular alternative or pair of alternatives. To do this, both types of measure- 
ments are carried out at three different levels of representation of information: 

Level 1 or pair of alternatives level. At this level both the agreement amongst 
all the experts and the distance between each individual expert’s opinion 
and the group opinion on each pair of alternatives are calculated. 

Level 2 or alternatives level. At this level, the consensus degree and the prox- 
imity on each alternative are obtained. 

Level 3 or experts’ level. The global consensus degree amongst all the experts 
and the distance between each individual expert’s opinion and the group 
opinion on all the alternatives are calculated. 

This means that in total six measurements are obtained, a consensus degree 
and a proximity measure at each one of the three levels. To make the computation 
of these six measurements in multi-granular GDM problems possible and easier, 
it is necessary to unify the different linguistic term sets into a single one. To do 
so, fuzzy sets on a basic linguistic term set (BLTS) are used, and the appropriate 
transformation functions are defined. 

The rest of the paper is set out as follows. The multi-granular linguistic GDM 
problem is described in Section 2. The different consensus and proximity mea- 
sures are presented in Section 3. Finally, in Section 4 we draw our conclusions. 
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Fig. 2. A set of seven terms with their semantics 

2 Multi-granular Linguistic GDM Problems 

We focus on GDM problems in which two or more experts express their prefer- 
ences about a set of alternatives by means of linguistic labels. A classical way 
to express preferences in GDM problems is by means of preference relations [3]. 
A GDM problem based on linguistic preference relations may be defined as 
follows: there are X = {x\,X 2 , ■ ■ ■ ,Xn} {n > 2), a finite set of alternatives, and 
a group of experts, E = {ei, C 2 , . . . , Cm} {m > 2); each expert provides his/her 
preferences on X by means of a linguistic preference relation, fxp^, \ X x X S, 
where S = {sq, si, . . . , Sg} is a linguistic term set characterized by its cardinality 
or granularity, #(5') = g + 1. Additionally, the following properties are assumed: 

1. The set S is ordered: Si > Sj, if i > j. 

2. There is the negation operator: Neg{si) = Sj such that j = g — i- 

3. There is the min operator: Min{si, Sj) = Si if Si < Sj. 

4. There is the max operator: Max{si, Sj) = Si if Si > Sj. 

The semantics of the terms is represented by fuzzy numbers defined on the 
[0,1] interval. One way to characterize a fuzzy number is by using a representation 
based on parameters of its membership function [1]. For example, the following 
semantics, represented in Figure 2, can be assigned to a set of seven terms via 
triangular fuzzy numbers: 

P = Perfect = (0.83, 1, 1) VH = Very_High = (0.67, 0.83, 1) 

H = High = (0.5, 0.67, 0.83) M = Medium = (0.33, 0.5, 0.67) 

L = Low = (0.17, 0.33, 0.5) VL = Very How = (0, 0.17, 0.33) 

N = None = (0,0,0.17) 

The ideal situation in GDM problems in a linguistic context would be one 
where all the experts use the same linguistic term set S to provide their opinions. 
However, in some cases, experts may belong to distinct research areas and have 
different levels of knowledge about the alternatives. A consequence of this is 
that the expression of preferences will be based on linguistic term sets with 
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different granularity, which means that adequate tools to manage and model 
multi-granular linguistic information become essential [4, 7, 12]. 

In this paper, we deal with multi-granular linguistic GDM problems, i.e., 
GDM problems where each expert Ci may express his/her opinions on the set 
of alternatives using different linguistic term sets with different cardinality Si = 
{sQ,...,Sp}, by means of a linguistic preference relation Pej = where 

p{^ £ Si represents the preference of alternative Xj over alternative Xk for that 
expert. 



3 The Measurement of Consensus 

in Multi-granular Linguistic Context 

The measurement of consensus in GDM problems is carried out using two dif- 
ferent measures: consensus degrees and proximity measures. However, as we as- 
sume multi-granular linguistic context, the first step must be to obtain a uniform 
representation of the preferences, i.e., experts’ preferences must be transformed 
(using a transformation function) into a single domain or linguistic term set that 
we call basic linguistic term set (BLTS) and is denoted by St- 

The measurement of consensus in multi-granular linguistic GDM problems 
is therefore carried out in three steps: (i) making the linguistic information uni- 
form, (ii) computation of consensus degrees and (iii) computation of proximity 
measures 

3.1 Making the Linguistic Information Uniform 

In this step, a basic linguistic term set (BLTS), St, has to be selected. To do 
this it seems reasonable to impose a granularity high enough to maintain the 
uncertainty degrees associated to each one of the possible domains to be unified. 
This means that the granularity of the BLTS has to be as high as possible. 
Therefore, in a general multi-granular linguistic context, to select St we proceed 
as follows: 

1. If there is just one linguistic term set, from the set of different domains 
to be unified, with maximum granularity, then we choose that one as the 
BLTS, St- 

2. If there are two or more linguistic term sets with maximum granularity, then 
the election of St will depend on the semantics associated to them: 

(a) If all of them have the same semantics (with different labels), then any 
one of them can be selected as St- 

(b) If two or more of them have different semantics, then St is defined as 
a generic linguistic term set with a number of terms greater than the 
number of terms a person is able to discriminate, which is normally 11 or 
13 [11], although we can find cases of BLTS with 15 terms symmetrically 
distributed [4, 10]. 
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Once St has been selected, the following multi-granular transformation func- 
tion is applied to transform every linguistic value into a fuzzy set defined on St' 

Definition 1 [4] If A = {Iq, . . . ,lp} and St = {co, ■ ■ ■ ,Cg} are two linguistic 
term sets, with g > p, then a multi- granular transformation function, taStj 
defined as 

TASt ■ ^ F{St) 

‘'^ASt — { i.^h ; C^ih) /he e A 

aih = m&xmin{pi,{y), Pchiv)} 

V 

where F{St) is the set of fuzzy sets defined on St, and piXy) and Pc^iy) are 
the membership functions of the fuzzy sets associated to the linguistic terms li 
and Ch, respectively. 

The composition of the linguistic preference relations provided by the experts 
with the multi- granular transformation functions "''^hl result in 

a unification of the preferences for the whole group of experts. In particular, the 
linguistic preference p\^ will be transformed into the fuzzy set, defined on St = 

{ Cq , . . . , } , 

TSisApf) = {{ch.afjf) / h = 0 , . . . ,5} 

a'ft = maxmin{/r 

y 

We will continue to denote ts^St membership 

degrees to denote the uniformed linguistic preference relation: 

Pi (*^i 0 7 ■ ■ • ! ^ig )''' Pi (*^i 0 5 ■ • ■ 5 ^ig ) \ 

„nl /„nl „nl\ „'rm / „ nn „nn\ / 

\^Pi (*^i 0 5 ■ ■ • 5 ^ig ) ' ' ' Pi (*^i 0 5 ■ * ■ 5 ^ig ) j 



3.2 Computation of Consensus Degrees 

In GDM problems, each consensus parameter requires the use of a similarity 
function to obtain the level of agreement among all the experts. Several sim- 
ilarity functions have been proposed to measure how far each expert is from 
the remaining ones, including the Euclidean distance, the cosine and sine of the 
angle between vectors, etc [2, 15]. 

Initially, we used these traditional distance functions to measure the proxim- 
ity between the linguistic preferences given by experts Ci , e ^ , by compar- 

ing the membership degrees vectors associated to them. However, after checking 
the results of some trials, we discovered cases in which unexpected results were 
obtained, as it is shown in the following example, which implied that these func- 
tions were not suitable for our objectives. 
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Example 1 Let pP = (1, 0, 0, 0, 0, 0), pP = (0, 0, 0, 1, 0, 0) and = (0, 0, 0, 0, 
0,1) be three experts’ assessments on the pair of alternatives (x\,X 2 ), the fol- 
lowing values are obtained using the Euclidean distance: 



j, 12 12 ^ 

rf(Pi , P2 ) = 






- «2?)^ = ^ 



7 / 12 12 \ 

.Pa ) = 



\ 






With the Euclidean distance, both preference values pP and pP are at the 
same distance from preference pP, although, it is clear, however, that the first 
one is further from pP than the second one. The problem in this case is the 
way the information of these fuzzy sets is interpreted, as a vector of membership 
values without having taking into account their positions in it. To take into 
account both the values and positions, a different similarity function able to 
represent the distribution of the information in the fuzzy set pp is necessary. 
The use of the central value of the fuzzy set, cup, is suggested: 



Jk 



Yl=o index(sl) ■ a] 






0 ^ih 



index{s)f) = h 



( 1 ) 



This value represents the central position or centre of gravity of the information 
contained in the fuzzy set pp = (ap, . . . ,ap). The range of the central value 
function is the closed interval [0,p]. 



Example 2 The application of (1) to the assessments of example 1 gives the 
following central values: 



cuP = 0, cuP = 3, cuP = 5. 

Other experts’ assessments as pP = (0.3, 0.8, 0.6, 0, 0, 0), pP = (0,0.3, 0.8, 
0.6, 0,0), and pY^ = (0, 0, 0, 0.3, 0.8, 0.6), their central values are: 

cup = 1.18, cuP = 2.18, and cup = 4.18. 

As expected, when the information (membership values) moves from the left part 
of the fuzzy set to the right part, the central value increases. 



The value |cup — cup| can be used as a measure of distance between the 
the preference values pp and pp, and, therefore, a measure of similarity or 
proximity between these two preference values, measured in the unit interval 
[0, 1], is defined as: 



s(pp,pp) = 1 - 



Ih lie 

* J 



( 2 ) 



Clearly, the closer s(pY , pY) to 1 the more similar pp and pp are, while the 
closer s(pp,pp) to 0 the more distant pp and pp are. 
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Example 3 The values of similarity between the assessments of example 1 are: 
s{pf,p\^) = 0.4, s{pf^,pf) = 0. 

Using the above similarity function (2), the computation of the consensus 
degrees is carried out in several steps: 

1. After the experts’ preferences are uniformed, the central values are calcu- 
lated: 



V t = 1, . . . , to; Z, fc = 1, . . . , n a Z yf fc (3) 

2. For each pair of experts e^, ej {i < j), a similarity matrix SMij = is 

calculated, where 

sm^^=s{pf,pf) (4) 

3. A consensus matrix, CM, is obtained by aggregating all the similarity ma- 
trices. This aggregation is carried out at the level of pairs of alternatives: 

cm}'^ = i,j = l,...,m f\\/ l,k = 1, . . . ,n A i<j 

In our case, we propose the use of the arithmetic mean as the aggregation 
function </>, although, different aggregation operators could be used according 
to the particular properties we want to implement [8]. 

4. Computation of consensus degrees. As we said in Section 1, the consensus 
degrees are computed at the three different levels: pairs of alternatives, al- 
ternatives and experts. 

Level 1. Consensus on pairs of alternatives, cp^^ , to measure the consensus 
degree amongst all the experts on each pair of alternatives. In our case, 
this is expressed by the element (/, k) of the consensus matrix CM, i.e., 

cp^^ = cm^^, yi,k = l,...,n A l^k 



The closer cp^^ to 1, the greater the agreement amongst all the experts on 
the pair of alternatives xi,Xk- This measure will allow the identification 
of those pairs of alternatives with a poor level of consensus. 

Level 2. Consensus on alternatives, ca} , to measure the consensus degree 
amongst all the experts on each alternative. For this, we take the average 
of each row of the consensus matrix CM . 



ca 



E 



n 

k^l 



cm 



Ik 



n 



( 5 ) 



These values can be used to propose modification of preferences associ- 
ated to those alternatives with a consensus degree lower than a minimal 
consensus threshold 7 , i.e, ca^ < 7 . 

Level 3. Consensus amongst the experts, ce, to measure the global consen- 
sus degree amongst the experts’ opinions. It is computed as the average 
of all consensus on alternative values, i.e. 



E n / 

1^1 



ce = 



n 



( 6 ) 
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If the consensus value ce is low then there exists a great discrepancy between 
the experts’ opinions, and therefore they are far from reaching consensus. In this 
case, the moderator would urge the experts to discuss their opinions further in an 
effort to bring them closer. However, when the consensus value is high enough, 
the moderator would finish the consensus process and the selection process would 
be applied to obtain the final consensus solution to the GDM problem [2, 15]. 



3.3 Computation of Proximity Measures 



Proximity measures evaluate the agreement between the individual experts’ opin- 
ions and the group opinion. Thus, to calculate them, a collective preference re- 
lation, Pe„ = has to be obtained by means of the aggregation of the set of 

(uniformed) individual preference relations {Pei = {Pi^)j * = 1; • ■ • i w}: 

with tp an “aggregation operator”. As = (o^q, . . . , then = (a],o, 

. . . , alpg) with 

which means that is also a fuzzy set defined on St- 

Clearly, the expression (2) can be used to evaluate the agreement between 
each individual expert’s preferences, Pg;, and the collective preferences, Pg„. 
Therefore, the measurement of proximity is carried out in two steps: 



1. A proximity matrix, PMi = {pm}^), for each expert e^, is obtained where 
pmf = s{pf,pf). 

2. Computation of proximity measures. Again, we calculate proximity measures 
at three different levels. 

Level 1. Proximity on pairs of alternatives, pp\^, to measure the proximity 
between the preferences, on each pair of alternatives, of each individual 
expert, e^, and the group’s ones. In our case, this is expressed by the 
element {I, k) of the proximity matrix PMi, i.e., 

yi,k=l,...,n A l^k 



Level 2. Proximity on alternatives, paP , to measure the proximity between 
the preferences, on each alternative, of each individual expert, et, and the 
group’s ones. For this, we take the average of each row of the proximity 
matrix PMi. 



pai = 






Ik 



( 7 ) 



Level 3. Experts ’s proximity, pci, to measure the global proximity between 
the preferences of each individual expert, ei, and the group’s ones. It is 
computed as the average of all proximity on alternative values, i.e. 



pei = 



E n } 

1 = 1 P^z 



n 



( 8 ) 
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If the above values are close to 1 then they have a positive contribution for 
the consensus to be high, while if they are close to 0 then they have a negative 
contribution to consensus. As a consequence, these proximity measures can be 
used to build a feedback mechanism, based on simple rules or recommendations 
to support the experts in changing their opinions and thus obtain the highest 
degree of consensus possible, as was done in [8]. 

4 Conclusions 

The reaching of consensus in GDM problems needs measurements to assess the 
consensus between the experts. In this paper, two types of measurements were 
proposed: consensus degrees and proximity measures. The first one is used to 
assess the agreement amongst all the experts’ opinions, while the second one is 
used to find out how far the individual opinions are from the group opinion. Both 
types of measurements are computed at three different levels of representation 
of information: pair of alternatives, alternatives and experts. 

We have also shown that to make the measurement of consensus possible in 
multi-granular linguistic GDM problems, it was necessary to unify the different 
linguistic term sets into a single linguistic term set. To do this, fuzzy sets defined 
on a basic linguistic term set (BLTS) were used. 

Finally, for future research, the proximity measures will be used to design 
a consensus support system able to generate advice on the necessary changes in 
the experts’ opinions in order to reach consensus, which would make the figure 
of the moderator unnecessary in the consensus reaching process. 
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Abstract. The method of quantification were developed and investi- 
gated for the purpose of analyzing qualitative data. In the second method 
of quantihcation, the matter of interest is to discriminate the categories 
of the response variable. 

For that purpose, numerical scores of each categories are introduced so 
that the categories of the response variable can be discriminated as well 
as possible by those score. Since the total score is the sum of each cate- 
gory’s score, the model is an additive model. Thus, if observations have 
a synergism, the method fails to grasp the structure. As a consequence, 
the response variable seems not to be discriminated by the method. 

In this paper, we propose an extension of Hayashi’s second method of 
quantihcation by applying a fuzzy integral approach. To use the degree of 
decomposition of scores, we can include interactions between categories 
to the model. 



1 Introduction 

The method of quantification (or quantification method) were developed and 
investigated for the purpose of analyzing qualitative data by Hayashi[4] and his 
colleagues in the Institute of Statistical Mathematics, Japan, and have been 
widely used in many fields such as social survey, behavioral science medicine 
and quality control. In addition, a lot of works related to the method have been 
investigated [2, 8, 11] 

Suppose that we obtain a set of observations shown in Table 1. The kind 
of observations often appears in the filed of social sciences such as psychology 
and market research. An individual is requested to select one answer for each 
question. 

Let J -b 1 be the number of questions. Question 0 is regarded as a response 
variables and the other questions are explanatory variables. The answer of Ques- 
tion 0 has g patterns, and the answer of Question j has Nj patters for 1 < j < J. 
Hereafter, an answer of a response variables (Question 0) is called “group”, 
a question of explanatory is called by “Item” and its answer is called by “cate- 
gory”. Thus, this kind of observation is called item-category type observations. 
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Table 1. Item and category type observations 





Question 0 


Question 1 




Question J 


Indiv.No. 


1 2 ••• g 


1 2 ■■■ Ni 




1 2 ■■■ Nj 


1 


V 


V 




V 


2 


V 


V 




V 












i 


V 


V 




V 












m 


V 


V 




V 



In the second method of quantification, the matter of interest is to analyze the 
relationship between the response variable and the explanatory variables, and 
to discriminate the categories of the response variable by using the information 
concerning to the explanatory variables. 

For that purpose, numerical scores 

x{jk) = (x(ll), . . . ,x(liVi), • • • ,x(Jl), . . .,x{JNj)) 

for the categories of J factor items are introduced so that the categories of the 
response variable can be discriminated as well as possible by those score. 

In the second method of quantification, the optimum score of individual 
observation is calculated as a sum of scores of categories. For example the scores 
of the first individual in Table 1, denoted by yi, is calculated as 

yi = x(lA^i) + h x{J2) 

Thus, it is regarded as an additive model. When the items have ordered cate- 
gories, some extension of the quantification method have been proposed [8, 10]. 

It often occurs that some response variables are characterized by particular 
combinations of categories. When response variables and explanatory variables 
are both ordinal scales, we can apply a fuzzy integral approach such as the 
Choquet integral, the Sugeno integral [3, 7, 9]. However, such a synergism can 
not be explained in the second method of quantification. 

In this paper, we reformulate the Hayashi’s second method of quantification 
in order to reflect interactions between categories, which is regarded as an ap- 
plication of fuzzy measure. Moreover, we propose an extension of the method 
in order to reflect interactions between categories in some items based on the 
formulation. 

The paper is organized as follows. Hayashi’s second method of quantification 
is outlined in Section 2. Formulation of the method of quantification in view 
of a fuzzy measure approach and an extension of the method are considered in 
Section 3. The method of obtaining optimum scores of each items are shown in 
Section 4. Conclusions are given in Section 5. 
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2 Hayashi’s Second Method of Quantification 

Re-ordering the observations described in Table 1 by the response variables and 
renumbering, we obtain the item category type observation shown in Table 2 
Let G = {1, ... ,5} be the set of group names. Binary data is defined 

as follows: 



where J, and Nj are the number of observations belonging to group /i, the 
number of items, and the number of categories of item j, respectively. 

Thus = 1 shows the pattern of answers of i-th individual of the 

group fx; the item of category k is j, in other words, the answer of the j-th 
question is k. The observation of Table 1 becomes Table 3 by the re-ordering. 

Thus the score of calculated as follows. Let x{jk),j = 1, . . . , J,k = 1,. . . ,Nj 
be the weights for category jk, and the score of f-observation of group ix, denoted 
by yi{v), is calculated as 



The aim of Hayashi’s second method of quantification is to obtain the set of 
weights to classify observations as well as possible. In other words. In Hayashi’s 
second method of quantification, x{jk) is determined as to maximize 




and (2) the response to item j is k 
0, otherwise 



1, if (1) observation i belongs to group /i 




( 1 ) 



j k 



Vb 

Vt 



where 







i=i k=l 
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Table 3. Reordered observation of Table 1 (the number in bracket is the original 
indiv. no.) 





Indiv. 


1 




J 


Groups 


No. 


1 2 • • • 


Ni 




12 ■■■ Nj 


1 


1(2) 


0 0 


1 




0 1 ••• 0 


2 


1(1) 


0 1 


0 




1 0-" 0 












9 


l(m) 


0 0 1 


0 




OO--- 1 



We use the following notations for calculation: 



Thus, 



X = 

'An 

A= : 

_Agl 

Bn 

Bgl 

Cn 



B = 



C = 



C, 



si 



. . . ,a:(Jl), . . .,x{JNj)]' 

ni{iy,jl) ••• m{ty,jNj) 



A-ij 

Bij 



B 






Ajjj — 



Bvj — 



1 






jl) • • • 

n{v,jl) ■ ■ ■ n{ty,jNj) 



C,j = - 
m 



n{iy,jl) ■ ■ ■ n{v,jNj)j 

n{jNj) 

n{jl) ■■■n{jNj) 



Cij 

CgJ. 

Vb = —x\B - C)\B - C)x = x'Sbx, 
m 

Vt = —x\A - C)'{A - C)x = x'Etx. 
m 



( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 



Eb = ^{B — C)'{B — C), Et = — C)'{A — C) are called the variation 

matrix between groups and the total variation matrix, respectively. 

A vector maximizing 



Vb x'Ebx 
Vt x'Etx 



is a solution of a generalized eigen problem 



Ebx = XEtx. 
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Table 4. Artificial data 



Groups 


Indiv. 


1 


2 


11 


12 


21 


22 


23 


1 


1 


1 


0 


1 


0 


0 




2 


1 


0 


1 


0 


0 




3 


1 


0 


1 


0 


0 


2 


1 


1 


0 


0 


1 


0 




2 


0 


1 


0 


1 


0 


3 


1 


0 


1 


0 


1 


0 




2 


0 


1 


0 


0 


1 



The eigen vector corresponding to the largest eigen value is the optimum weight. 
Further properties of the methods of quantification are found in [-5] . 



3 Interaction Between Items 

Since combination of responses between items sometimes characterizes a class, 
we need to take account of interaction between items. 

For example, in Table 4, suppose that group 1 is characterized by both 
ni(l, 11) and ni(l,21) being 1. Then, the model should include not only scores 
of ni(l, 11) and ni(l, 21) but also {ni(l, 11), ni(l, 21)}. 

To extend the model, we reconsider the second method of quantification. The 
most complicated model for Table 2 is 

yi{v) = x{lti{v, 1), . . . , Jti{v, J)) (6) 

where 

= k < — > rii{v,jk) = 1 , 

that is, ti{v,j) indicates that the response (category) of the item j. For example, 
in Table 4 



yi(l)=x(ll,21) 



2/3(3) =x(12, 23). 

The number of optimum weights 1), . . . , Jti{v, J)) is Il/=i 

assume that each optimum score is decomposed as 

.7 

x{lU{v,l),...,Jti{iy, J)) = '^x{jU{v,j)), 

i=i 



( 7 ) 
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the model equation (6) becomes 

J J nj 

j=l j=l k=l 

and it agrees with that of the second method of quantification. 

Thus, the complexity of the model depends on decomposition of entire score 
(7). If , Jti{v, J)) is the sum of the score of each category, the 

model becomes simplest additive model. In order to take account of the effect of 
intersection, we should reflect these relation to decomposition of scores (7). 

When we consider the interaction between (11) and (21), that is, the rela- 
tion between first categories of first and second items, decomposition of scores 
becomes: 



yi{v) = a:(ll, 21, 3ti(z/, 3), . . . , JU{v, J)) = x(ll, 21) -h ^ x(j, U{v,j)) (8) 

1=3 

.7 

= x{ll,2l) + '^x{jk)ni(v,jk), li riiiy,!!) = 21) = I (9) 

i=3 
.7 

yi{v) = '^x{jk)m{v,jk), if {rii{iy, 11) = 0} V {m{v,2l) = 0} 

1=1 

We can consider higher order interactions between items in a similar fashion. 

4 How to Obtain Optimum Scores 

In this section, we show how to obtain the optimum scores. We will show that 
the vectors of optimum scores is the eigen vectors corresponding to the largest 
eigen value of some matrix. Thought we consider the case that the interaction 
(11) and (21) is included to the model, this result can directly be applied to 
a general case. 

Corresponding to the equations (2), (3), (4) and (5), we use the following 
notations: 



X = [x(ll, 21),a:(ll), . . .,x{JNj)] 



A = 



A\o All • • ■ Aij 

Ago Agl--- AgJ 



Auo = 



A 



^3 



n’i{y,jl) ••• n’^{y,jNj) 



<( 17 , 11 , 21 ) 

( 17 , 11 , 21 ) 
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B = 


Bio Bii ■ 


■Bij' 


j -^i/O — 


■n'(z/,ll,21)‘ 




BgO Bgl ■ 


■ BgJ_ 




_n'(z/,ll,21)_ 



n'{iy,jl) ■■■ n'{iy,jNj) 



•• • n'{vJNj) 



'Cio Cn---Cij' 


1 


'n'{ll,21)' 




, ao = - 




0 

1 


m 


_n'(ll,21)_ 



■■ ■ n'iJNj) 

c.j = - ■■ ■■ 

m ' 

where 






n[{u,jk) 



n'{v, 11,21) 



n'{v,jk) 



f 1 , {rii{v, 11) = 1} A {rii{iy, 21) = 1} 

[ 0 , otherwise 

f n^{v,jk) , {{j,k) ^ (l,l)}A{(j,fc) ^ (2,1)} 

1 1 , {(a^) = (1,1)1 A 11) = 1} A 21) = 0} 

1 1 , {(a^) = (2,1)1 A {n*(i^, 11) = 0} A {rii(i/, 21) = 1} 

[ 0 , otherwise 

g 

^<( 1 ^, 11 , 21 ), n'(ll,21) = ^n'(z/,ll,21) 

i=l v—1 

rriu g 

'^n[{v,jk), n'ijk) = '^n'{v,jk). 

2 = 1 1 ^ = 1 



For the artificial data shown in Table 4, let us consider the interaction be- 
tween (11) and (12). The responses of individuals of the first group is the same, 
that is ni{l, 11) = rii{l, 21) = 1, f = 1, 2, 3. Thus, 



y^{l) = x{ll,21), y 2 {l) = x{ll,21), y^il) = x{ll,21) 
j/i(2) = a;(ll) + x{22), 2 / 2 ( 2 ) = x(12) + x(22) 

2/i(3) = x{12) + x{21), 2/2(3) = x(12) + x(23) 



Thus, the observations are regarded as item-category type data shown shown in 
Table 5 

The optimum score vector is a vector which maximizes Vb /Vt, where 
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Table 5. Artificial data (with interaction between (11) and (21)) 



Groups 


Indiv. 


(11,21) 


1 


2 


11 


12 


21 


22 


23 


1 


1 


1 


0 


0 


0 


0 


0 




2 


1 


0 


0 


0 


0 


0 




3 


1 


0 


0 


0 


0 


0 


hline2 


1 


0 


1 


0 


0 


1 


0 




2 


0 


0 


1 


0 


1 


0 


3 


1 


0 


0 


1 


0 


1 


0 




3 


0 


0 


1 


0 


0 


1 



Thus, the optimum score vector is the eigen vector corresponding to the 
largest eigen values of generalized eigen problem 

Sbx = XStx. 



5 Conclusions 

Since then number of the model (8) is more than that of the model (1), fitness 
of the model (8) is better than that of (1). Thus, we need to compare these 
two models, and generally a lot of models, and select the most suitable model. 
This is a problem of model selection. To choose a suitable model, we can use the 
concept of information criteria such as AIC[1] or GIC[6]. We will compare sum 
information criteria as an index of model selection for the model in this paper. 
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Abstract. The symbolic data analysis is a new trend in multivariate descriptive 
statistics whose main purpose consists in analyzing and processing set-valued 
random variables. Such variables are derived by summarizing large datasets 
and abstracting information in aggregated form. Some typical examples of 
symbolic datasets are those encoded by means of interval-valued variables or 
modal variables. Unlike classical data, symbolic data can be structured and can 
contain internal variation. The aim of this paper is to extend the formal 
framework of symbolic data analysis for allowing fuzzy-valued variables to 
deal with. Some related approaches based on granular computing are also 
proposed or discussed. 



1 Extending the Formal Framework of Symbolic Data Analysis 
for Allowing Fuzzy-Valued Variables to Deal with 

The symbolic data analysis was introduced by Diday and his collaborators ([2], [1]) 
and is concerned in various types of symbolic variables: multi-valued variables, 
interval-valued variables, multi-valued modal variables, interval-valued modal 
variables. Typically, the symbolic approach considers information aggregates 
abstracted by summarization from huge datasets in such a way that the resulting 
summary dataset is of a manageable size. It does not really promote a holistic 
viewpoint because such aggregates are interpreted rather in terms of confined sets of 
numerical values than in terms of irreducible concepts. Therefore, it is not primarily 
intended to capture the dissimilarity between concepts (each one viewed as a whole), 
but the variability of the internal “ingredients” of all the aggregates representing the 
outcomes of a random set-valued variable with respect to a collapsing central 
tendency (namely, a point- wise mean). In this respects, the symbolic data analysis 
differs from some holistic approaches used in the realm of granular computing, where 
the granular events are regarded as cohesive structures and the granularity of 
outcomes is transferred to the mean of the granular random variable as well. 

In this section, our aim is to provide an extension of symbolic data analysis in 
order to deal with the more challenging case of fuzzy-valued variables. 
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1.1 Representation and Cardinality of a p-Dimensional Fnzzy Grannie 

Let us consider a p-dimensional fuzzy granule A = A^x-'-xAp defined on the pro- 
duct space X = X[ X • • • X Xp . Representation formalism and cardinality for A can be 
given either in terms of membership functions, 

^[0,1], jUAix) = mm{jU^^{xi),...,jUAp(Xp)), VxeX (i) 

H= jMA(^)d^= j jraiAj^Aiixi),--;jUAp(Xp)]d^i ...d^p 

^eSupp(A) ^ieSupp{Ai) 4peSupp(Ap) 

or in terms of a -level intervals: 

A" =Ai"x-xA“; «G[0,1] (3) 

1 I I * /' I I 

^1= j |^“|(ia= j da (4) 

0 0 i=l 

In what follows, we will restrict attention to the class of normal fuzzy convex sets 
on 91^, whose « -level sets are nonempty compact convex sets for all ct>0. In 
particular, let A, be an LR -fuzzy set. We have: 

4“ = W («)> 4^ = l4 - ^ 4 . • L“‘ (a), xl +r^.-R-\a)\, i = (5) 

1 p I I 

^1 = j ri 4 * (®) “ (®) da = 

0 <=i 

1 p 

= j n -Xai +^Ai + -R^\a) 

0 i=l 

The latest integral can be evaluated numerically by means of a quadrature formula 
(e.g. the adaptive Simpson quadrature). 

1.2 Univariate Statistics: Mean and Variance of a Fnzzy Event 

Any outcome A of a fuzzy-valued variable X can be perceived as a fuzzy event. 
Essentially, it represents only one observation, but actually incorporates a 
conglomerate of possibilities, due to the inherent imprecision of linguistic 
specification. 

If a uniform probability distribution over the support of a given fuzzy event A is 
assumed, the mean value and the variance of A are calculated as: 



( 6 ) 

da 
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X, = 



Supp{A)cfR 

Supp{A)<dyi 



(^1 = 



-/^aA) d^ 

Supp(A)c.‘}j [7 P 

Supp{A)d% 



( 7 ) 



where Supp{A) denotes the support of A . 



1.3 Univariate Statistics: the Empirical Density Fnnction 
of a Fnzzy-Valned Variable 

Let X be a fuzzy- valued variable and assume the outcome of X for an object 
ue E = {I,..., n} is an LR-fiizzy set. We denote by A^^ = {(x, p ,(x))\ 

X e the “left” part of A^ , by = {x | v e [x^_^ , ^^„ ] } the 

“central” part of A^ , and by A^ = {(x, p o (x)) | x e [x^ , x^ + ] } be the “righf ’ 

U H U 

part of . The individual description vectors x , as elements of a virtual description 
space v;V((i„ ) , are assumed to be uniformly distributed over the support of LR- fuzzy 
set Aj^ . In particular, for a trapezoidal fuzzy set, we have: 

R{x<^|x£ v/r(d„)} = 




where \A^\ denotes the cardinality of A^, i.e., |A|= \l^A^i^)d^ = 

^eSupp(A^) 

IA I + \A I + Uif I • The empirical density function of X is 



m 



1 

” u&E A 



( 9 ) 



From 



I 

^eSupp(A^) A 



= l it follows that 
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n 



I 

usE 



1 

^Supp(A^) 



\M 









fSi=i 



where n = card(E) . 



( 10 ) 



1.4 Univariate Statistics: the Point-Wise Sample Mean 
of a Fnzzy-Valned Variable 



Let X be a fuzzy-valued variable whose outcomes are LR-fuzzy sets. The point-wise 
sample mean of X is defined as 




n 



X 

ueE 



^eSupp(A^) 




-^A ^A+l'A 







(11) 



For a fuzzy-valued variable whose outcomes are trapezoidal fuzzy sets, the point- 
wise sample mean is 




1.5 Univariate Statistics: the Sample Variance of a Fnzzy-Valned Variable 

The sample variance of X is 



/t-2 — J_ V 

- X 

n ueE 



=-x, 

^ ugeI 






Supp{A)(z^ 



(^) d^ 

Supp{A)(z.^ 



--X 



2 

^4 ' 






: • . 


'^u 


+ X 


2 

4 ' 


A^ 












a! 





-X 



( 13 ) 
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1.6 Bivariate Statistics for Fuzzy-Valued Variables: the Sample Covariauce 
aud the Sample Correlatiou 

Let X be a random fuzzy- valued veetor defined on the product space Xj x Xj . 
Then, the empirical joint density function of X is 






^ ueE \^ul ^ ^uZ 



with 






= rX 



I 



J T. . I 



\A„, xA„ 



ue£\^^jeSupp(A^,) ^2^Supp(A„2) I "2 1 

The sample covariance is now obtained of the form 

Cov(Xi,X2)= j j(#i-Xj(#2-X2)-/(^l,#2)d#id#2 

^ie3!&69t 



= 1 



— I 



' ueE\ 



I J ^1 ^2 

^ieSupp(A„i) ^2^Supp(A^2) 



(14) 



(15) 



1 






(16) 



Furthermore, the sample correlation can be derived in the usual way 

Cor(Xi, X 2 ) = Cov(Xi, X 2 )/^Far(Xi) ■ Far(Xi) (17) 



1.7 A Generalization of Principal Components Analysis 
Allowing Fuzzy Granules to Deal with 

The first extension of principal components analysis allowing the processing of 
fuzzy-termed data was introduced by Georgescu in 1996 ([4]), using a holistic 
approach with roots in the holistic theory of perception (the so-called gestalt theory). 
The guidelines of this approach where based upon the assumption that a fuzzy-valued 
variable (i.e., a random variable whose outcomes are fuzzy granules) is suitable to be 
endowed with a granular (fuzzy) mean, instead of a point-wise mean. Consequently, 
the variance and the covariance for such variables where deduced as deviations of 
fuzzy outcomes with respect to their fuzzy mean. One year later, Cazes et.al ([1]) 
proposed a method of conducting PCA on symbolic interval data, where variance and 
covariance of interval-valued variables are typically deduced with respect to a point- 
wise mean. Such an approach may also have good reasons when the granules are 
perceived as aggregates or conglomerates of points generated by an agglomerative 
mechanism rather than cohesive structures. 
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In what follows, we propose a generalization of this method in order to deal with 
p-dimensional fuzzy granules, instead of p-dimensional hyperboxes. 

Let us eonsider a data eollection describing n individuals by means of p fuzzy- 
valued variables. In other words, each individual m g ii = {1, . . . , «} is described by a 
p-dimensional fiizzy granule. Such a granule is delimited by two p-dimensional 
hyperboxes, each one with 2^ vertices: the core and an envelope around it. Thus, 
there are 2 ■ 2^ = 2^'*'* vertices describing an individual, which can be represented by 
a 2^'^'xp matrix. Finally, a ( 2« -2^ xp) -dimensional matrix is constructed by 
vertical concatenation in order to obtain a description of the n individuals. The 
principal component analysis can now be applied in a classical way to this matrix for 
reducing the initial p-dimensional space into a subspace with a smaller dimensionality 
(^< p) . The dimensionality reduction is subject to the minimum information loss 
condition. It allows each individual to be represented in the subspace of principal 
components by an s-dimensional fiizzy granule, obtained by projecting the 
corresponding p-dimensional fuzzy granule onto this subspace. Figure 1 illustrates the 
projection process guided by the minimum loss of inertia. Additionally, we illustrate 
how the bi-dimensional fuzzy granule representing the granular principal component 
corresponding to the individual 'V is built by projection from the 3-dimensional fuzzy 
granule corresponding to the same individual. As one can see, we first obtain the 
2^ = 2^ = 8 numerical principal components associated with the envelope, by 
projecting the vertices of the corresponding 3-dimensional hyper-rectangle and the 8 
numerical principal components associated with the vertices of the core. Afterward, 
the inferior and superior borders of the granular principal component are built by 
taking the minimum and the maximum from the two sequences of numerical principal 
components, respectively. 




Fig. 1. Six individuals described by 3-dimensional fuzzy granules and their projection onto the 
plan of principal components 



2 Some Related Approaches Based on Granular Computing 

The aim of this section is to articulate a formal framework allowing some granular 
computing methods - originally designed for dealing with hyperboxes - to be carried 
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out in terms of fuzzy granules. First, we focus on generalizing the granular clustering 
method introduced by Pedrycz and Bargiela (2002), where only crisp granules 
(hyperboxes) are enabled during the cluster growing process. Our refinements 
address formal extensions of some metrics, measures and criteria resulting in a 
unitary and polymorphic treatment of either crisp or fuzzy granules. Second, we are 
concerned in developing a non-conventional multidimensional scaling method whose 
aim is to reconstruct the unknown configuration of p-dimensional fuzzy granules 
describing n individuals. The only information about them is imprecisely given as a 
set of pair-wise dissimilarities expressed in terms of trapezoidal fuzzy sets. 

2.1 Granular Clustering Carried Out in Terms of Fuzzy Granules 

Suitable metrics for p-dimensional fuzzy granules. Let = (xj . , Xj. r^, )^^ 

be an LR-fuzzy set, Af" = j(x, ju^i (x)) | x e -£j., Xj, ]| be the “left” part of A^ , 
Af =[x\ xe[x^, ,x*, ]| be the “central” part of A^ , and 
Af = |(x, (x)) I xe [x^. , Xj. ]| be the “righf’ part of A^ . Denote by Sjj , 

s^, and s^, , the functions mapping the interval [0, 1] into three intervals that cover 
the support of A^ : 



Sj. :[0, 1] ^ [xj, -£j., 


>xa.]; SA.{a) = x\-eArd^^((^) 




:[0,l]^[x^_.,xjj 


; s‘2.{t) = {i-t)-x\+t-xl 


(18) 


4, ■■ [0, 1] ^ [x% , x% + r^. ] ; s^. (a) = x% + R \a) 




Assuming the functions above to be Lebesgue square-integrable allows us to use 
the norm induced by the inner product that equips the Hilbert space L 2 ([0, 1]) 


[0,1] 


<Sa,Sb>= f5^(a)-4(«)^(t^«) 
[0,1] 




0 


1 

<Sa, Sg >= (0 • Sg (t) dt 

0 


(19) 


k j(sA(o:)f Mda); 

[0,1] 


<s^,Sg > = (a) ■ Sg (O') A(da) 

[0,1] 
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where A is a normalized measure on [0, 1] . Different choices for A allow different 

definitions of distances between fuzzy sets. For example, one can use a constant 

1 

weighting function as w(a) = 1, V« e [0, 1] (i.e., /l([0, 1]) = jl Ja = 1 ), or an 

0 

increasing weighting function as n’(a) = 2a for «e[0, 1] (i.e., /l([0, 1]) = 

1 

jia da = l). 

0 

In general, let us consider two p-dimensional fuzzy granules A = A^x---xAp and 
B = B^x--xB p , each one defined on the product space X = Xj x • • • x X p . The 
quadratic distance between the components Aj and B^ of A and B , along the 
dimension i , is given by 



^2 (A’ 2 “ 3 



j|4(^)-4_(^)p-w(^) de+ 



+ 



iO) - (^)f de + {0) - s% ■ w{0) de 



4-4 

R R 

^ A, - 



'B, 



4/9 



1/18 



1/18 --/ 



-ill/ 



illi 



4/9 

0 

-11/ Jl 



illi 



-11/ II ' 



jibii, 

0 

-11/ II" 

3 II 



4 ,- 4 , 

R R 

^Ai -^Bi 



where 



||/,||j= \r\a)A{da)- \\ht= \[r\a)J A{da) 

[0,1] [0,1] 

\\l 2 ,l= lR-\a)A(da) -, \\l4l= \(R~\a)J A{da) 
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The quadratic distance between A and B can now be defined as follows 



Sl{A,B) = f^ Sl{A,,B,) = f^ 
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2.1.1 The Aggregation of Two p-Dimensional Fuzzy Granules with Respect 
to a Compatihility Measure. 

The main step of the iterative proeess is finding the two elosest information granules 
in order to aggregate them into a more eomprehensive one. Let C = agg(A, B) be the 
resulting granule. In terms of p-dimensional LR-fuzzy granules, the aggregation 
proeess ean be earried out as follows 

C = Cl X ■ • • X with Cf = f c, • ice), + rc, ■ R («)) (23) 



The eore of C^ is obtained for a = l 

cj = [xq ,Xq]= (min [x\ ,Xb), max [x^ , )) (24) 



The support of C, is obtained for a = 0 



cf - {xq t Ci ’Xq + rq ) 

= {^m{xq -lq,Xq ~ I b, 



), max (x 



R 

A 







(25) 



The eompatibility measure guiding the seareh for the two elosest fuzzy granules 
ean be defined with respeet to the eardinality |c| of C 

compat{A,B) = l-S2{A,B)-e°‘'^^^ (26) 

This eriterion applies to normalized granules, i.e., granules lying in the unit 
hypereube. Maximizing the eompatibility measure means that the pair of eandidate 
fuzzy granules to be elustered should not only be elose enough (i.e., the distanee 
between them should be small), but the resulting granule should be eompaet (i.e., its 
expansion along every direetion must be well-balaneed). The latter requirement 
favours sueh pairs of granules that are aggregated into a new granule with large 
eardinality (volume). 



2.1.2 Expressing Inclusion of two p-Dimensional Fuzzy Granules. 

For expressing the extent to whieh a p-dimensional fuzzy granule A is ineluded in 5 , 
an inclusion index is used, whieh is defined as a ratio of two eardinality measures 

\A[]B\ 

incl (A, B) = — — I — (28) 

\A\ 

The average of the maximum inelusions rates of eaeh eluster in every other eluster 
is a measure of eluster overlapping, whieh ean be used to eneourage merging of 
elusters that have signifieant overlap: 
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where c is the current number of clusters and A,- and A , are i th and / th cluster 
respectively. 



2.1.3 Application. 

Let us start with an initial configuration of fuzzy granules, represented in the pattern 
space (as it is shown in Fig. 2.). Forming the clusters is clearly a process of growing 
information granules. At each stage, two granules are aggregated into a new one, 
embracing them. In this way, one condenses the initial set of granules into a reduced 
number of representative clusters, while enlarging their granularity. The structure of 
data is captured by the location and granularity of the final configuration of clusters. 
Remarkably, each cluster is a well-delimited region in the pattern space. The family 
of such clusters may be used as a concise descriptor of fuzzy-termed data structure. 




Fig. 2. The sequence of cluster growing over the granular clustering process 
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2.2 A Non-Conventional Multidimensional Scaling Method Carried Out 
in Terms of Fuzzy Granules 

Suppose a set of pair-wise dissimilarities between n individuals are given in terms of 
trapezoidal fuzzy sets 

Sy={Slj{a), djj{a)) = {df: - d^+{l-a)-py) (30) 

For reeonstmeting the unknown eonfiguration of p-dimensional fuzzy granules 
deseribing the n individuals, we will attaeh to eaeh individual i a p-dimensional 
fuzzy granule x • • • x A^p , where A^^. is given in trapezoidal form, 

Aik («) = {Ai ia), aH (a)) = (xfk -(l-a)- p ik , + {I- a) -q ik ) (31) 



2.2.1 A New Concept of Distance Between p-Dimensional Fuzzy Granules: 
the Fuzzy Distance. 

We propose another eoneept of distanee: the fuzzy distanee between p-dimensional 
fuzzy granules. For two fuzzy granules whose eomponents along eaeh dimension are 
trapezoidal fuzzy sets, the fuzzy distanee is still a trapezoidal fuzzy set (provided that 
the granules have vide interseetion) and ean be defined as 

dij{a) = [dfj{a), {a))=[dl 1 y, d^ +{\-a)-rij) (32) 

This is beeause, in sueh a ease, djj (a) is an inereasing funetion that interpolates 
linearly between dg(0) and dy(l) whereas dy{a) is a deereasing funetion that 
interpolates linearly between dy(l) and (0) . Fig. 3. illustrates two bi-dimensional 
fuzzy granules and the trapezoidal shape of the fuzzy distanee between them. 




Fig. 3. The fuzzy distance between multidimensional fuzzy granules 
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The parameters , ly , can be expressed with respect to , 

^uk ’ duk > for u = i, j , and the latter can then be determined by minimizing the 
following stress function 

\[4-(}-a)-iij-S^{a'^da+\[dy+(\-a)ry-dy{a'^da\ ( 33 ) 

<</Vo 0 ) 



The unknown configuration of p-dimensional fuzzy granules describing the n 
individuals can now be reconstructed (see [3] for details). 
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Abstract. In decision-making, information is usually provided by 
means of fuzzy preference relations. However, there may be cases in 
which experts do not have an in-depth knowledge of the problem to be 
solved, and thus their fuzzy preference relations may be incomplete, 
i.e. some values may not be given or may be missing. In this paper we 
present a procedure to find out the missing values of an incomplete 
fuzzy preference relation using the values known. We also define an 
expert consistency measure, based on additive consistency property. 
We show that our procedure to find out the missing values maintains 
the consistency of the original, incomplete fuzzy preference relation 
provided by the expert. Finally, to illustrate all this, an example of the 
procedure is presented. 

Keywords: Decision-making, fuzzy preference relations, missing values, 
consistency, additive consistency, incomplete information 



1 Introduction 

Decision-making procedures are increasingly being used in various different fields 
for evaluation, selection and prioritisation purposes, that is, making preference 
decisions about a set of different choices. Furthermore, it is also obvious that 
the comparison of different alternative actions according to their desirability in 
decision problems, in many cases, cannot be done using a single criterion or 
one person. Indeed, in the majority of decision making problems, procedures 
have been established to combine opinions about alternatives related to differ- 
ent points of view. These procedures are based on pair comparisons, in the sense 
that processes are linked to some degree of credibility of preference of one al- 
ternative over another. Many different representation formats can be used to 
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express preferences. Fuzzy preference relation is one of these formats, and it is 
usually used by an expert to provide his/her preference degrees when comparing 
pairs of alternatives [1, 3, 5, 7]. 

Since each expert is characterised by their own personal background and 
experience of the problem to be solved, experts’ opinions may differ substantially 
(there are plenty of educational and cultural factors that influence an expert’s 
preferences). This diversity of experts could lead to situations where some of 
them would not be able to efficiently express any kind of preference degree 
between two or more of the available options. Indeed, this may be due to an 
expert not possessing a precise or sufficient level of knowledge of part of the 
problem, or because that expert is unable to discriminate the degree to which 
some options are better than others. In these situations such an expert is forced 
to provide an incomplete fuzzy preference relation [9]. 

Usual procedures for multi-person decision-making problems correct this lack 
of knowledge of a particular expert using the information provided by the rest 
of the experts together with aggregation procedures [6] . These approaches have 
several disadvantages. Among them we can cite the requirement of multiple ex- 
perts in order to learn the missing value of a particular one. Another drawback is 
that these procedures normally do not take into account the differences between 
experts’ preferences, which could lead to the estimation of a missing value that 
would not naturally be compatible with the rest of the preference values given 
by that expert. Finally, some of these missing information-retrieval procedures 
are interactive, that is, they need experts to collaborate in “real time” , an option 
which is not always possible. 

Our proposal is quite different to the above procedures. We put forward 
a procedure which attempts to find out the missing information in an expert’s 
incomplete fuzzy preference relation, using only the preference values provided 
by that particular expert. By doing this, we assure that the reconstruction of 
the incomplete fuzzy preference relation is compatible with the rest of the infor- 
mation provided by that expert. In fact, the procedure we propose in this paper 
is guided by the expert’s consistency which is measured taking into account only 
the provided preference values. Thus, an important objective in the design of our 
procedure is to maintain experts’ consistency levels. In particular, in this paper 
we use the additive consistency property [4] to define a consistency measure of 
the expert’s information. 

In order to do this, the paper is set out as follows. Section 2 presents some 
preliminaries on the additive consistency property. In Section 3, a new consis- 
tency measure and the learning procedure are described. We also include a brief 
discussion of the possible situations in which the procedure will be successful 
in discovering all the missing values and we provide the sufficient conditions 
that will guarantee this. In Section 4, we present a simple but illustrative exam- 
ple of how the iterative procedure to discover the missing values in incomplete 
fuzzy preference relations works. Finally, our concluding remarks and topics for 
possible future research are pointed out in Section 5. 
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2 Preliminaries: Additive Consistency 

Preference relations are one of the most common representation formats of in- 
formation used in decision-making problems because they are a useful tool in 
modelling decision processes, above all when we want to aggregate experts’ pref- 
erences into group preferences [3, 4, 5, 8]. In particular, fuzzy preference relations 
have been used in the development of many important decision-making proce- 
dures. 

Definition 1 [5, 7] A fuzzy preference relation P on a set of alternatives X is 
a fuzzy set on the product set X x X, i.e., it is characterized by a membership 
function 

Hp-.XxX — ^[0,1] 

When cardinality of X is small, the preference relation may be conveniently 
represented by the n x n matrix P = {pij) being pij = pp{xi,Xj) \/i,j G 
{ 1 , . . . , n} interpreted as the preference degree or intensity of the alternative Xi 
over Xj\ Pij = 1/2 indicates indifference between Xi and xj {xi ~ Xj), pij = 1 
indicates that xt is absolutely preferred to Xj, and pij >1/2 indicates that Xi is 
preferred to Xj (xi >- Xj). Based on this interpretation we have pu = 1/2 \/i G 
{1, . . . ,n} {xi ~ Xi). 

The previous definition does not imply any kind of consistency. In fact, pref- 
erences expressed in the fuzzy preference relation can be contradictory. As stud- 
ied in [4], to make a rational choice, a set of properties to be satisfied by such 
fuzzy preference relations have been suggested. Transitivity is one of the most 
important properties concerning preferences, and it represents the idea that the 
preference value obtained by directly comparing two alternatives should be equal 
to or greater than the preference value between those two alternatives obtained 
using an indirect chain of alternatives [2] . One of these properties is the additive 
transitivity [8]: 

{pij -0.5) + {pjk - 0.5) = {pik -0.5) V*,/, fc e {l,...,n} (1) 

or equivalently: 

Pij + Pjk - 0.5 = Pik G {l,...,n} (2) 

In this paper, we will consider a fuzzy preference relation to be “additive 
consistent” when for every three options in the problem Xi ,Xj,Xk G X their 
associated preference degrees pij , pjk , Pik fulfil Equation 2. An additive consistent 
fuzzy preference relation will be referred to as consistent throughout this paper, 
as this is the only transitivity property we are considering. 
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3 A Learning Procedure to Estimate Missing Values 
in Fuzzy Preference Relations 
Based on Additive Consistency 

As we have already mentioned, missing information is a problem that we have 
to deal with because usual decision-making procedures assume that experts are 
able to provide preference degrees between any pair of possible alternatives. We 
note that a missing value in a fuzzy preference relation is not equivalent to a lack 
of preference of one alternative over another. In fact, a missing value may be 
the result of the incapacity of an expert to quantify the degree of preference 
of one alternative over another, and thus the expert decides not to give a pref- 
erence value to maintain the consistency of the values provided. In such cases, 
these missing values can be estimated from the existing information using, as a 
guidance criterion, the consistency degree of that information. 

To do this, in this section, we firstly give a definition of a consistency measure 
of a fuzzy preference relation based on the additive consistency property. We will, 
then, design the learning procedure to estimate missing values from existing ones. 
Finally, we will provide sufficient conditions that guarantee the success of the 
learning procedure in estimating all the missing values of an incomplete fuzzy 
preference relation. 



3.1 Consistency Measure 

Equation 2 can be used to calculate the value of a preference degree pik using 
other preference degrees in a fuzzy preference relation. In fact, 

cpik=Pij+Pjk-0.5 (3) 

where means the calculated value of pik via j, that is, using pij and pjk- 
Obviously, when the information provided in a fuzzy preference relation is com- 
pletely consistent then cph^, \/j € {!,..., n} and pik coincide. However, the 
information given by an expert does not usually fulfil Equation 2. In such cases, 
the value 



\^Pik-P^k 

i=i 



can be used to measure the error expressed in a preference degree between two 
options. This error can be interpreted as the consistency level between the pref- 
erence degree pik and the rest of the preference values of the fuzzy preference 
relation. Clearly, when epik = 0 then there is no inconsistency at all, and the 
higher the value of epik the more inconsistent pik is with respect to the rest of 
the information. 
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The consistency level for the whole fuzzy preference relation P is defined as 
follows: 



— 1 

CLp = ^ (5) 

— n 

When CLp = 0 the preference relation P is fully (additive) consistent, oth- 
erwise, the higher CLp the more inconsistent P is. 

3.2 A Proposal for Learning Missing Values 

In the following definitions we express the concept of an incomplete fuzzy pref- 
erence relation: 

Definition 2 A function f : X — > Y is partial when not every element in the 
set X necessarily maps to an element in the set Y . When every element from 
the set X maps to one element of the set Y then we have a total function. 

Definition 3 An incomplete fuzzy preference relation P on a set of alterna- 
tives X is a fuzzy set on the product set X x X characterized by a partial mem- 
bership function. 

As per this definition, we call a fuzzy preference relation complete when its 
membership function is a total one. Clearly, the usual definition of a fuzzy pref- 
erence relation (Section 2) includes both definitions of complete and incomplete 
fuzzy preference relations. However, as there is no risk of confusion between 
a complete and an incomplete fuzzy preference relation, in this paper we refer 
to the first type as simply fuzzy preference relations. 

In the case of an incomplete fuzzy preference relations there exists at least 
a pair of alternatives (xi,Xj) for which pij is not known. We will introduce and 
use throughout this paper the letter x to represent these unknown preference 
values, i.e. pij = x. We also introduce the following sets: 

{(bj) I bJ €{!,..., n} A iyfj} (6) 

MV = {(i, j) I pij = X, (i,j) e A} (7) 

EV = A\ MV (8) 

MV is the set of pairs of alternatives for which the preference degree of the first 
alternative over the second one is unknown or missing; EV is the set of pairs 
of alternatives for which the expert provides preference values. Note that we do 
not take into account the preference value of one alternative over itself, as this 
is always assumed to be equal to 0.5. 
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In the case of working with an incomplete fuzzy preference relation, we note 
that Equation 4 cannot be used. An obvious consequence of this is the need to 
extend the above definition of CLp to include cases when the fuzzy preference 
relation is incomplete. We do this as follows: 

Htk = {j I ihj), U,k) G EVjWiy^ k (9) 



SPik 



\^Plk-P^k 



#Hik 



( 10 ) 



CEp = {{i,k) e EV I 3j : (i,j),{j,k) G EV} (11) 



CLp 



(i,k)eCEp 



#CEp 



( 12 ) 



We call CEp the computable error set because it contains all the elements for 
which we can compute every epik- Clearly, this redefinition of CLp is an exten- 
sion of Equation 5. Indeed, when a fuzzy preference relation is complete, both 
CEp and A coincide and thus ij^CEp = — n. 

To develop the iterative procedure to learn missing values, two different tasks 
have to be carried out: 



A) To establish the elements that can be discovered in each step of the proce- 
dure, and 

B) To produce the particular expression that will be used to find out a particular 
missing value. 



A) Elements to be Learnt in Step h 

The subset of the missing values MV that can be learnt in step h of our procedure 
is denoted by LMVh {learnahle missing values) and defined as follows: 



h-l 



LMVh = < (b k) G MV \ IJ LMVi 



1=0 



3j : (*, j), (j, k)GEVU IJ LMv}j | 



(13) 

with LMVq = 0. 

When LMVmaxiter = 0 with maxiter > 0 the procedure will stop as there 

maxiter 



will be no more missing values to learn. Furthermore, if LMVi = MV 

1=0 

then all missing values are learnt and consequently the procedure was successful 
in the completion of the fuzzy preference relation. 
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B) Expression to Learn the Value pik 

In order to learn a particular value pik with (*, k) G LMVh, in iteration h, we 
propose the application of the following three step function: 

function learn_p(i,k) 

/h-i 

(iJ),(j,k)eEVu |jLMV| 

\i=o 

2. Calculate cpf,. = 

#lik 

3. Make p;k = cp[^ + z with z G [— CLp, CLp] randomly selected, 
subject to 0 < Pik + z < 1 

end function 





With this procedure, a missing value pik is estimated using Equation 3 when 
there is at least one chained pair of known preference values pij, pjk that allow 
this. If there is more than one pair of preference values that allow the estimation 
of Pik using Equation 3 then we use their average value as an estimate of the 
missing value, Finally, we add a random value z G [—CLp, CLp] to this 
estimate in order to maintain the consistency level of the expert, but obviously 
forcing the estimated value to be in the range of the fuzzy preference values 
[ 0 , 1 ]. 

The iterative learning proeedure pseudo-code is as follows: 

LMVo = 0 
h = 1 

while LMVh yl 0{ 
for every (i, k) G LMVh{ 
learn_p(i,k) 

} 

h++ 

} 



We consider this procedure to be successful when all missing values have been 
estimated. However, as we have previously mentioned, there are cases when not 
every missing value of an incomplete fuzzy preference relation can be learnt. In 
the following, we provide an example illustrating this situation. 
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3.3 Some Missing Values cannot be Learnt 
by the Iterative Procedure 

In this section we provide sufficient conditions to assure the learning of all missing 
values in the incomplete fuzzy preference relation; an example where not all 
missing values can be learned; and a brief discussion on the role of the additive 
reciprocity property in the learning process of missing values. 



A) Sufficient Conditions for Learning All Missing Values 

As we will see later, there are cases where all missing information cannot be es- 
timated using our learning procedure. However, to obtain conditions that guar- 
antee that all the missing information in an incomplete fuzzy preference relation 
could be estimated is of great importance. In the following, we provide sufficient 
conditions that guarantee the success of the above learning procedure. 

It is clear that if a value j exists so that for all i € {1, 2, . . . , n} both (i, j) 
and (j, k) do not belong to MV, then all the missing information can be learnt in 
the first iteration of our procedure {LMV\ = MV) because for every pik G MV 
we can use at least the pair of preference values pij and pjk to estimate it. 

In [4] , a different sufficient condition that guarantees the learning of all miss- 
ing values was given. This condition states that any incomplete fuzzy preference 
relation can be converted into a complete one when the set of n — 1 values 
{pi2,P23j ■ • ■ ,Pn-in} is kiiown. Another condition, more general than the previ- 
ous one, is when a set of n — 1 non-leading diagonal preference values, where each 
one of the alternatives is compared at least once, is known. This general case 
includes that one when a complete row or column of preference values is known. 
However, in these cases the additive reciprocity property is also assumed. 



B) Impossibility of Learning All the Missing Values 

The following is an illustrative example of an incomplete fuzzy preference relation 
where our procedure is unable to learn all the missing values. 

Suppose an expert provides the following incomplete fuzzy preference relation 



P = 



/ — e e X x\ 

e — X e X 

X X — X X 

e X X — e 

\x X e e — J 



over a set of five different alternatives, X = {xi, X 2 , X 3 , X 4 , X 5 }, where x means 
“a missing value” and e means “a value is known” . 



Remark 1. We note that the actual values of the known preference values are 
not relevant for the purpose of this example. 
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At the beginning of our iterative procedure we obtain: 



LMV, = {(1, 4), (2, 3), (2, 5), (4, 2), (4, 3), (5, 1)} 



as we can find pairs of preference values that allow us to calculate the missing 
preference values in these positions. Indeed, the following table shows all the 
pairs of alternatives that are available to calculate each one of the above missing 
values: 



Missing value {i, k) 


Pairs of values to be learnt pik 


(1,4) 


(1,2), (2,4) 


(2,3) 


(2,1), (1,3) 


(2,5) 


(2,4), (4,5) 


(4,2) 


(4,1), (1,2) 


(4,3) 


(4,1), (1,3); (4, 5), (5, 3) 


(5,1) 


(5,4), (4,1) 



The other missing values cannot be learnt in this first iteration of the proce- 
dure. If we substitute all the x' s values learnt in this iteration by the number 1 
(indicating the step in which they have been learnt) we obtain: 



P = 



/ — e e 1 x\ 

e — 1 e 1 

X X — X X 

e 1 1 — e 

y 1 a; e e — J 



In the next iteration, in order to construct the set LMV 2 we can use the 
values expressed directly by the expert as well as the values learnt in iteration 
1. In our case we have LMV 2 = {(1,5), (5,2)}: 



Missing value {i, k) 


Pairs of values to be learnt pik 


(1,5) 


(1,2),(2,5);(1,4),(4,5) 


(5,2) 


(5,1),(1,2);(5,4),(4,2) 



and the incomplete fuzzy preference relation at this point is: 



P = 



^ — e e 1 2 ^ 

e — 1 e 1 

X X — X X 

e 1 1 — e 

1 2 e e — j 



In the next iteration LMV 3 = 0. The procedure ends and it does not succeed 
in the completion of the fuzzy preference relation. The reason for this failure 
is that the expert did not provide any preference degree of the alternative X 3 
over the rest of the alternatives. Fortunately, this kind of situation is not very 
common in real-life problems, and therefore the procedure will usually be suc- 
cessful in finding out all the missing values. Clearly, if additive reciprocity is also 
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assumed (this is a direct consequence of the additive transitivity property) then 
the chances of succeeding in estimating all the missing values would increase, as 
we show next. 

C) Additive Reciprocity Property 

In most studies, preference relations are usually assumed to be reciprocal. In 
particular, additive reciprocity is used in many decision models as one of the 
properties that fuzzy preference relations have to verify [1, 5]. Additive reci- 
procity is defined as: 



Pij+Pji = l Vi,j e {l,2,...,n} (14) 

Our iterative procedure does not imply any kind of reciprocity. In fact, it 
permits missing values in fuzzy preference relations to be estimated when this 
condition is not satisfied (as we show in Section 4). Furthermore, the procedure 
itself does not assure that the learnt values will fulfil the reciprocity property. 

However, if we assume that the fuzzy preference relation has to be reciprocal, 
then this would allow some of the missing values that were not possible without it 
to be estimated. In the previous example allpsk values that it was not possible to 
estimate could have been easily learnt assuming the additive reciprocity property. 

In what follows, we describe how to implement the use of the additive reci- 
procity in our procedure, and the changes we need to implement to assure that 
estimated values fulfil this property. 

Firstly, we need to guarantee that the incomplete fuzzy preference relation 
given by the expert fulfils the reciprocity property, i.e. Pij+Pji = 1 y{i,j), (j, i) G 
EV. This means that the first step of our procedure has to be the computation 
of those missing values with a known reciprocal one, i.e. 

yiiJ)GMVA{j,t)GEV. (15) 

The following steps of our procedure will be as described above but restricted 
to the learning of missing values above the leading diagonal of the incomplete 
fuzzy preference relation, i.e. pij with i < j. The last step of each iteration will 
consist in the computation of the corresponding missing values pji below the 
leading diagonal again using the reciprocity property. 

4 Illustrative Example 

In this section we use a simple but illustrative example to show the iterative 
procedure for learning missing values in incomplete fuzzy preference relations. 

Let us suppose that an expert provides the following incomplete fuzzy pref- 
erence relation 



^ — X 0.4 X 
X - 0.7 0.85 
X 0.4 - 0.75 
yO.3 X X — j 



P = 
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The first thing to do is to calculate the consistency level of P, CLp. To do 
this, we start calculating all possible epik- In this case, we can only calculate 
ep 24 and £^34 as in the rest of the cases pik is missing and there is no pij, pjk to 
calculate the corresponding cp^f.. 



ep2i = |P23 + PM - 0.5 - P24| = |0.7 + 0.75 - 0.5 - 0.85| = 0.1 
SP34 = \P32 + P24 - 0.5 - P34| = |0.4 + 0.85 - 0.5 - 0.75| = 0 

These low values of £p 24 and £^34 mean that the inconsistency between p 24 
and the rest of the given information is low while the consistency of P 34 and the 
rest of the given information is total. 

The next step consists in calculating CLp as the average of all the epik 
values: 



CLp = = 0.05 



At this point, we apply our iterative procedure: 



LMVi = {(1,2), (1,4), (2,1), (3,1), (4,3)} 

For each element (t,fc) € LMVi we calculate cp'^. For example, cp '12 is ob- 
tained as: 



, £P13 + £P32 - 0.5 r, . , r, . „ r: no 

cpi2 = j = 0.4 -I- 0.4 — 0.5 = 0.3 

Using the same procedure we obtain: 



cp'i^ = 0.65; cp '21 = 0.65; cp'^^ = 0.55; cp'^^, = 0.2 

Next, we proceed to add to each one of the above values a random value 
z G [—0.05, 0.05] in order to maintain the expert’s level of consistency. As a result 
of this, we obtain the following incomplete fuzzy preference relation: 

/ - 0.32 0.4 0.61 \ 

0.68 - 0.7 0.85 
0.5 0.4 - 0.75 
\ 0.3 X 0.24 - / 

In the second iteration of our procedure we have LMV 2 = {(4,2)}, 

/ (P41 +P 12 - 0.5) -h (P43 +P32 - 0.5) 

CP42 = 2 ^ 

and P 42 = 0.13 -I- z with z G [—0.05, 0.05] chosen randomly, which gives us: 

/ - 0.32 0.4 0.61 \ 

0.68 - 0.7 0.85 
0.5 0.4 - 0.75 
\ 0.3 0.17 0.24 - y 
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Obviously, LMV^ = 0 which means that our procedure was successful in 
the process of discovering all the missing values of the original incomplete fuzzy 
preference relation P. 



5 Concluding Remarks and Future Research 

In this paper we have discussed the importance of consistency in decision-making 
problems, and we have presented a common issue that must be addressed when 
attempting to solve this kind of problem: incompleteness of information. 

In particular, we have focused our attention on incomplete fuzzy preference 
relations and the issue of finding out their missing values. To do this, we have 
presented a new iterative procedure to learn missing values which is guided by 
the additive consistency level of the information known. 

In future research, a new induced OWA (IOWA) operator will be developed 
to aggregate information giving more importance to those experts whose fuzzy 
preference relations are most consistent. Finally, a general decision procedure, 
implementing both the learning procedure and the new IOWA operator, will be 
developed to solve group decision-making problems with incomplete information 
and inconsistency in the sources of information. 



References 

[1] Chiclana, F., Herrera, F., Herrera- Viedma, E.: Integrating three representation 
models in fuzzy multipurpose decision making based on fuzzy preference relations. 
Fuzzy Sets and Systems 97 (1998) 33-48 228, 236 

[2] D. Dubois, H. Prade, Fuzzy Sets and Systems: Theory and Application, (Academic 
Press, New York, 1980). 229 

[3] Fodor, J., Roubens, M.: Fuzzy preference modelling and multicriteria decision 
support. Kluwert, Dordrecht (1994) 228, 229 

[4] E. Herrera-Viedma, E., Herrera, F., Chiclana, F., Luque, M.: Some issues on con- 
sistency of fuzzy preference relations. European Journal of Operational Research 
154 (2004) 98-109 228, 229, 234 

[5] Kacprzyk, J.: Group decision making with a fuzzy linguistic majority. Fuzzy Sets 
and Systems 18 (1986) 105-118 228, 229, 236 

[6] Kim, S. H., Choi, S. H., Kim, J. K.: An interactive procedure for multiple attribute 
group decision making with incomplete information: Range-based approach. Eu- 
ropean Journal of Operational Research 118 (1999) 139-152 228 

[7] Orlovski, S.A.: Decision-making with fuzzy preference relations, Fuzzy Sets and 
Systems 1 (1978) 155-167 228, 229 

181 Tanino, T.: Fuzzy preference orderings in group decision making. Fuzzy Sets and 
Systems 12 (1984) 117-131 229 

[9] Xu, Z.S.: Goal programming models for obtaining the priority vector of incom- 
plete fuzzy preference relation. International Journal of Approximate Reasoning, 
(2004) to appear. 228 



Decision Making in a Dynamic System 
Based on Aggregated Fuzzy Preferences 



Yuji Yoshida 

Faculty of Economics and Business Administration, University of Kitakyushu 
4-2-1 Kitagata, Kokuraminami, Kitakyushu 802-8577, Japan 
yoshida@kitakyu-u. ac . jp 



Abstract. The fuzzy preference is related to decision making in artificial 
intelligence. A mathematical model for dynamic and stochastic decision 
making together with perception and cognition is presented. This paper 
models human behavior based on the aggregated fuzzy preferences, and 
an objective function induced from the fuzzy preferences is formulated. 
In dynamic decision making, there exists a difficulty when we formulate 
the objective function from fuzzy preferences since the value criterion of 
fuzzy preferences in dynamic behavior transforms together with time and 
it is formulated gradually based on the experience. A reasonable criterion 
based on fuzzy preferences is formulated for the dynamic decision mak- 
ing, and an optimality equation for this model is derived by dynamic 
programming. Mathematical models to simulate human behavior with 
his decision making are applicable to various fields: robotics, customers’ 
behavior analysis in marketing, multi-agent systems and so on. 



1 Introduction 

The fuzzy preference in decision making models human behavior, and it is re- 
lated to decision making in artificial intelligence. This paper presents a dynamic 
decision making model with fuzzy preferences which is designed for the system, 
which cognizes the encountering states in given environments and which makes 
decision based on his own reasoning([ll]). We discuss a reasonable criterion 
based on aggregated fuzzy preferences in dynamic decision making. 

Objective functions in decision making are usually given as invariant value 
criteria, for example utility functions in economics and management science and 
Lyapunov function induced from distances in control theory ([2, 9, 15]). This 
paper deals with decision maker’s personal fuzzy preferences in dynamic behav- 
ior instead of these objective functions ([3]). When we deal with fuzzy prefer- 
ences in dynamic decision making, we have difficulty different from static one 
since the value criterion of fuzzy preferences in dynamic behavior transforms 
together with time and it is formulated based on the experience obtained by 
steps and stages. In this paper, we introduce a dynamic decision making model 
together with perception and cognition, and we formulate a reasonable criterion 
based on aggregated fuzzy preferences. In management science, we recently find 
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Perception Decision Making 




Fig. 1. Decision making with preferences 



a lot of multi-attribute/multi-objective problems, which are based on scenario- 
description with computer simulation. To deal with such type problems, the 
optimization with fuzzy preference is an effective approach described in form 
of pair-wise comparison of objects. We try to deal with dynamic decision mak- 
ing with preferences from viewpoint of artificial intelligence([l, 13, 14, 10]). By 
dynamic programming ([7, 16]), we also discuss an optimality equation for the 
model in a situation where the decision maker is accustomed to his environment. 
Mathematical models simulating human behavior arising from his decision mak- 
ing are needed in various fields: robotics, customers’ behavior analysis in mar- 
keting, multi-agent systems and so on([5]). Our dynamic decision making model 
in this paper is designed under the concept in Figure 1. 

2 Preference and Ranking 

In this section, we introduce basic properties of fuzzy relations and fuzzy prefer- 
ences, and we discuss a ranking method based on them. Finally, we consider an 
extension of fuzzy preferences and the ranking method for a dynamic decision 
making model. 

2.1 Fuzzy States 

In this model, the states are represented by fuzzy sets because it is generally 
difficult for the decision maker to know complete information about the cog- 
nized states/objects which he is confronted with. Let C be a sigma-compact 
convex subset of some Banach space. The attributes of the states/objects can 
be represented as the d-dimensional coordinates when the Banach space is taken 
by d-dimensional Euclidean space States are given by fuzzy sets on C. Fuzzy 
sets on C are represented by their membership functions d : C [0, 1] which 
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are upper-semicontinuous and satisfy the normality condition: max^j^c a{x) = 1 
([17]). T{C) denotes the family of all fuzzy sets d on C. The fuzziness is caused 
from the lack of decision maker’s knowledge about current states and it models 
that there exist limitations in his cognitive faculty. By introducing the fuzziness 
to the representation of states, we can model the vagueness factors regarding fu- 
ture states and current states. We consider two kinds of states: One is perceived 
states and the other is cognized states. A perceived state represents a state out- 
side the system and a cognized state represents a state inside the system. In this 
section, we deal with cognized states. 

2.2 Fuzzy Preference Relations 

Let 5 be a subset of IF(C'), which S has finite elements. A map /r : 5 x 5 i— *■ [0, 1] 
is called a fuzzy relation on S. Fuzzy preferences are defined by fuzzy relations 
on 5 ([3, 6]): A fuzzy relation /r on 5 is called a fuzzy preference relation if it 
satisfies the following conditions (a) - (c): 

(a) /i(a, a) = 1 for all d G S. 

(b) /i(a, c) > min{/x(a, b), pi{b, c)} for all d,b, c G S. 

(c) fi{d, b) + p{b, a) > 1 for all d,b G S. 

Here, fi{d, b) means the degree that the decision maker likes d than b. 

2.3 Score Ranking Functions 

We introduce a ranking method of states from a viewpoint of fuzzy preference, 
which is called a score ranking function{[ii\) . For a fuzzy preference relation /i on 
S, the following map r on 5 is called a score ranking function of states induced 
by the fuzzy preference relation /r: 

»'(«)= ^ {fJ-{a,b) - ( 1 ) 

b^S:b^a 



for d G S. 

First we consider a case where S has a linear order Define a relation p, on 
S by 

otLrwle. (2) 

Then, /r is a fuzzy preference relation, and for a, 6 S 5, it holds that d ^ b 
r(d) > r(b). 

Next we consider a subset C := {c*|f = 1, 2, • • • , n} of iF(C) such that C has n 
elements and a linear order where n > 2. Let /i be a fuzzy relation on C and 
let r be the score ranking function induced by fi: 

n 

= '^{fJ-{c\b) - p(b,c^)} = '^r^^, 
bec i=i 



( 3 ) 
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where r®-' := /i(c®,c-^) — ^{cP , 0 ^) (i,j = ,n). Here, the score ranking 

function r takes values in [— n + l,n — 1]. By using ranking methods r, we can 
consistently extend the ranking on C to one on C which has finite elements 
and satisfies C C C' C T{C). In the next section, we introduce a dynamic 
model where the number of cognizable states increases with time. Then, we 
need a scaling of the score ranking function r to normalize its value region which 
expands with time and the number of elements in C. Since C has n elements, we 
introduce a scaling translation f„ : [—n + 1, n — 1] [0, 1] by 



fn{x) 



X 1 

2(n- 1) 2 



(4) 



for X G [— n + 1, n — 1]. 

2.4 An Extension of Fuzzy Relations and Score Ranking Functions 

Let us consider about a fuzzy relation and a score ranking function on an ex- 
tended state C . Let A denote the family of fuzzy sets a which are represented 
as 

n 

~a = Y, (5) 

i^l 

with some weight vector • • • , ic") satisfying paper, 

the system itself makes decision making on the basis of weighted aggregation 
of preferences. By extending the notion of aggregation defined on the closed 
interval [0, 1]([13]), we introduce a weighted aggregation for extended values in 
K to discuss a decision making model with perception and cognition. Let w G K" 
be a weight vector w = ,ic") such that = 1. A function 

: K" K is called an extended weighted aggregation if it satisfies the 
following conditions (a) - (c): 

(a) ^(0, 0, • • • , 0; w) = 0 and ^(1, 1, • • • , 1; ru) = 1. 

(b) Let i = 1,2,--- ,n with ic® > 0 and let (a^,a^,--- ,a®®) G M". Then the 
map a® ^(a^, • • • , a®, • • • , o’®; ru) with respect to the t-th element is non- 
decreasing. 

(c) Let i = 1,2,--- ,n with ic® < 0 and let ,a®®) G K®®. Then the 

map a® i-^- ^(a^, • • • , a®, • • • , o’®; w) with respect to the i-th element is non- 
increasing. 

Using an extended weighted aggregation h, we can define a fuzzy relation /i' 
on an extended set C U {a} as follows: 



fj,' =n on C X C, 


fj.'{d,a) = 1, 


(6) 


fj.'{d,b) := C(m(c\&),' 




(7) 






(8) 
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Conservative case 



Extensive case 



Fig. 2. Extension of the cognizable scope 



for b G C. If the weights take values in [0, 1], then A is the set of convex linear 
combinations of C. In this paper, we consider a case where we accept that the 
weights w'‘ (I = 1,2,- •• ,n) take values not only in the interval [0,1] but also 
outside the interval. This extension enables to cognize new objects outside the 
past knowledge. Thus, it will be possible to learn new objects in much wider 
scope together with time. We need to deal with fuzzy relation taking values 
in real numbers outside [0,1]. However, the scaling of the fuzzy relations will 
be done totally for the score ranking function at each time when we consider 
a criterion based on fuzzy preference in Section 4 from estimation results of score 
ranking in Section 3. We also note that this extension is applicable even when 
the order ^ is a partial order on C. 

In this paper, we adopt the following aggregation: Let 7 be a nonzero real 
number. Define an aggregation 

• • • , a"; re) := g~^ (9) 

for (a^, a^, • • • , a") S M" and a weight vector • • • , re") € K" satisfying 

re® = 1, where : K 1 -^ K is an increasing function and g~^ : K 1 -^ K is its 
inversse function given by 



rsign(a)la]''' if a yf 0 

\0 ifa = 0 , 

rsign(a)la]^/^ if a yf 0 

\o ifa = 0 . 



( 10 ) 

( 11 ) 



Then, the fuzzy relation g! on an extended set C U {a} is represented as 
follows: 



g' = g on C X C, 



/r'(a, a) = 1 



( 12 ) 
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fi'{b, d) := M X w"9iiKb, ?) 






( 13 ) 



(14) 



for b G C. The corresponding extended score ranking function r' for the state 



a IS 



X{m'(0:^) - 9'(b,a)} 

beC 


(15) 


X{m'(«:^) - a)}. 

bec 


(16) 



3 Dynamic Decision Making Model and Score Ranking 

In this section, we introduce a dynamic decision making model with fuzzy pref- 
erences and a time space {0,1,2,--- ,T}. Next, we estimate the score ranking 
function to establish a scaling function. The estimation is needed to define an 
objective function in the next section. 



3.1 A Dynamic Decision Making Model 

Let (So be a subset of J-{C) such that So ■= {c*|f = 1,2, •• • ,n} has n elements 
and a partial order Y. (5o is called an initial state space and it is given as a training 
set in a learning model. Let po be a fuzzy preference relation on Sq such that 
for for a,b G So 



{ 1 if a and b are comparable and a ^ b 

0 if a and b are comparable and a ^ 6 (17) 

j3 if a and h are incomparable 

with some (3 G [0,1] given by the decision maker. When we deal with actual 
data, if a fuzzy relation given by the decision maker does not satisfy the transitive 
condition (b) in the definition of fuzzy preferences, one of the reasonable methods 
is to apply its transitive closure([3j). Let t{= 0, 1,2, • • • ,T) be a current time. 
An action space At at time t{< T) is given by a compact set of some Banach 

space. We deal with two kinds of states. One is perceived states outside the 

system and the other is cognized states inside the system since generally there 
exists some difference between them. The cognized state is computed from the 
perceived state by approximation computation on the basis of the initial states 
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and the past states. At time t, a current cognized state is denoted by s*. An 
initial state sq is given by an element in Sq. Define a family of states until 
time t by St := Sq U {si, S 2 , • • • , St} = {c\c^, • • • ,c"-, si, S 2 , • • • ,5*} for t = 
1, 2, • • • , T. For t = 0, 1, 2, • • • , T, Ut{€ At) means an action at time t, and ht = 
(so,uo, si^ui, ■ ■ • ,St-i,Ut-i,St) means a history with states sojSir'' ) St and 
actions uo,ui,--- ,Ut-i- Then, a strategy is a map nt : {ht} At which is 
represented as TTt{ht) = Ut for some ut G At- A sequence tt = {TTt}^^^ of 
strategies is called a policy. Let p be a nonnegative number. We deal with the 
case where a current cognized state St is represented by a linear combination of 
the initial states c^, c^, • • • , c" and the past states si, S 2 , ■ • • , St_i: 



n t— 1 

(18) 

i=i j=i 



St = y^w\s 



\ _n+7 ~ 

} Wt ■'Sj, 



for some weight vector {w],Wt,--- ,Wt^* S ^ satisfying —p < w\ < 

1 + p (i = 1, 2, • • • , n + t — 1) and '^t = where we put 



1 if So = c® 

0 if So yf c* 



(19) 



for i = 1, 2, • • • , n. The equation (18) means that the current cognized state St 
is understandable from the past states St-i = {c^, c^, • • • , c", si, S 2 , ■ • • , St_i}, 
which we call an experience set. Then, p is called a capacity factor regarding the 
range of cognizable states. The cognizable range of states becomes bigger as the 
positive constant p is taken greater in this model. The range is measured by p 
observing the interval —p < w\< 1 + p. If p = 0 for all t = 1, 2, • • • , T, the system 
is conservative and the cognizable range of states at any time t is the same as 
the initial cognizable scope, which is the convex full of = {c^, c^, • • • ,£”■}. 



3.2 Perceived States and Cognized States 

Let a perceived state at time t by 5t{G IF(C')), which depends on the action Ut-i 
taken at the previous time t—1 since the action Ut-i affects to the surroundings 
and the state Ot to be perceived at time t. To determine a cognized state St 
in the form (18) from observed data regarding the perceived state o*, we use 
fuzzy neural networks: First, we give an input data from the perceived state Oj 
by {(cc^, a^), (x^, a^), • • • ,(x^, 0 !^)} cC x [0,1] such that Ot{x’‘) = for I = 
1,2,--- , L, and next we determine the weight vector (w},Wt,--- in 

(14) so as to minimize the following error between the data and a cognizable 
value in (14): 

^ j (x') j . (20) 

From the structure of the optimization problem, a fuzzy regression method using 
neural networks is applicable to (20)([4]). 
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Fig. 3. Perceived states and cognized states 



3.3 A Translation of Weights 

Let t{= - ,T) be a current time. By (18) we define a fuzzy relation on 

St by induction on t as follows: 



■= fJ-t-i on St-i X St-i, Ht{st, St) ■= 1, 



Ht(st,a) :=g^^ ,d)) + g^{gt(sj,a)) 



. ^=1 



i-1 

t-i 



fj.t{d,st) :=57^ ( '^wlg^{fit{d,c')) + '^Wt~^^ g^{gt{d,Sj)) 

i=i 



. ^=1 



( 21 ) 

( 22 ) 

(23) 



for d G St-i- 

To simplify the problem, we introduce a translation of weights. For i = 
1,2, •• • ,n, we define a sequence of weights {tCt+il^o inductively by Wq := Wq 
and 

+ (24) 

1=0 

(t = 1, 2, • • • , T). Then, we can easily check X)r=i computation rule 

for the extended fuzzy relations at time t is given as followings: For a current 
time t{= 0, 1,2, • • • ,T) and an initial state or a past state d(G 5t_i), it holds 
that 



gt{st,d) = g^^ 
gt{d,st) = g~^ 



'^wlg^{gt{c\d))j , 


(25) 




(26) 
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P, 

8 



6 : 

4 

. * t 

2 4 6 8 10 

Fig. 4. Translated capacity pt for p = 0.05 



3.4 Estimation of Translated Weights 

In this paper, we use the sequence of weights in (24) rather than the 

sequence of weights in (18). Then, the following equation gives a com- 

putation rule regarding capacities.: Define a sequence of capacities {pt}f-i by 

Pt+i = Pt + p(l + t + tpt) (27) 

for t = 1, 2, • • • , T. Then, it holds that —pt < wl < 1 + pt ior i = 1,2, ■■■ ,n and 
,T. 

Figure 4 shows that the interval of weights at time t = 10 is wider and about 
[— Pt,l -I- Pt] = [—8.13,9.13] comparing the initial [0, 1] when we give a capacity 
p = 0.05 in (27). This means that the cognizable range attains about 17.26 times 
of the initial range after 9 steps. The capacity term pt is an increasing function 
of t, and the increase is corresponding to the facts that the range of cognizable 
states St expands with time t. 

3.5 Representation of Scores and Its Estimation when 7 = 1 

When 7 = 1, we obtain the following simple representation by weights regarding 
the score 7 ( 5 *): It holds that 



n n t—1 n n 

n{st) = + XI XI XI ( 28 ) 

i—1 j—1 m—1 j—1 

for t = 1, 2 , • • • , T, where is given by := /io(c*,c-^) — po{c^,c’‘), i,j = 
1, 2 , • • • , n. An upper bound K{n, t) for the value region, which the score rt{st) 
takes is shown when 7 = Is in the following figure. 
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K{n,t) 

80 i 

60 ^ 

4 0; 

20 ' 

* 

2 3 4 5 6 7 8 9 ^ 

Fig. 5. The upper bound K{n, t) of |rt(st)| for p = 0.01, n = 4 and 7 = 1 



If we give 4 initial states and p = 0.01, the initial upper bound of |ro(so)| 
is 2n — 2 = 6. However, at time 9, the upper bound K(n,t) is increasing up to 
87.5451. 



4 Dynamic Decision Making 

with Fnzzy Preferences in Stochastic Environments 

Now we consider a stochastic decision process on the results of the previous 
sections. 



4.1 A Dynamic Decision Process in Stochastic Environments 

Let (12, P) be a probability space. Let tt be a policy and let t{= 0, 1, 2, • • • , T) be 
a current time. In this paper, we consider that the decision maker is accustomed 
to the environments, and we assume that the current state St is cognizable with 
probability from information of the initial states c^, c^, • • • , c" and the past states 
si, S 2 ,‘ ■ • , st_i. Namely, from the viewpoint of (18), we deal with only policies 
such that fuzzy random variables : 12 i— *■ P{C) taking values in states are 
given by ([8, 12]) 

n t — 1 

X: = Y, WIP + Y (29) 

i=i 

for some sequence of real random variables satisfying —pt < < 

1 + Pt (* = 1, 2, • • • , n + f - 1) and where 



r 1 if XJ = c 
1 0 if X- ^ P 



for * = 1, 2, • • • , n. 



(30) 
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In canonical formulation, we define Wt{uj) = \WI(ijj) Wf{oj) ■■ ■ 

:= u{t) G Qt for t = ,T and a sample path w = {w(t)}^Q G 17 = 

ritLo Then we can define the random variable by (29) since to deter- 
mine a strategy Ut in a policy tt is corresponding to the determination of a weight 
random variable Wt under a history ht-i- In this section, by dynamic program- 
ming, we discuss an optimality equation in the model. We put the transition 
probability from a current state s* to a next state Sj+i by = St+i) 

when a history ht = (sq, uq, si, wi, • • • , St_i, Ut-i, St) is given. We note that the 
weight Wt+i depends on its history ht+i, and Qt is decided by a strategy ttj 
based on ht since perceived states Ot+i and the weights Wt+i depend on the 
action ut- 



4.2 Objective Functions Induced from Preferences 

Now we introduce a scaling function for the score rt and we define objective 
functions and its expected value from the results regarding the score ranking 
functions in the previous section. For t = 1, 2, • • • , T, we define a scaling function 



Mx) 



X 1 

2K{n,t) ^ 2’ 



(31) 



where K{n,t) is the function given by K(n,t) := (n — l)(2pt -I- 1) -I- (2pt -|- 
f)SL=i(2Pm + 1) for 7=1. Then, the scaling function ipt is a map (pt ■ 
[—K{n,t),K{n,t)] [0,1]. Now, an expected total value Vq {ho) is given by 



Fo"(So) := E,, 



■ T 






(32) 



for ho := So and a policy tt, where ifg(,[-] denotes the expectation with respect 
to paths with the given initial state sq. Then we note that (pt(rt(X[) G [0, 1] for 
each t = 0, 1, • • • ,T. From the scaling function (31), we can take a balance among 
the scores pt{rt{Xt)) (t = 0, 1, • • • , T). Let t{= 0, 1, 2, • • • , T) be a current time. 
To derive an optimality equation, we introduce total values Vt{ht) at time t by 



Vt^ht) := Eh, 



- rp 

_m—t 



(33) 



where Eh,[-] denotes the expectation with respect to paths with a history ht- 
Next, we define the optimal total values Vt{ht) at time t by 



Vt{ht) ■■= supFt^(ht). (34) 

7T 

Then, we obtain the following equation. 



(The optimality equation): It holds that 



Vt{ht) = sup Eh,[>pt{rt{st)) + Vt+i{ht,Ut,X^^^)] 



(35) 
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for t = 0, 1, 2, • • • , T — 1, and 



VrifiT) = 



(36) 



at terminal time T. 

5 Conclusion 

In this paper, we have discussed the followings: 

— A method to extend the cognizable range by steps. 

— A criterion based on aggregated fuzzy preferences in dynamic and stochastic 
decision making. 

~ An optimality equation for this model derived by dynamic programming. 

Mathematical models simulating human behavior with his decision making are 
applicable to problems in various fields. 

— For example, robotics, customers’ behavior analysis in marketing, multi- 
agent systems, multi-attribute/multi-objective problems in management sci- 
ence. and so on. 
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Abstract. In several sitnations, a set of objects must be positioned 
based on the preferences of a set of individuals. Sometimes, each individ- 
ual can/does only inclnde a limited subset of objects in his preferences 
(partial preferences). We present an approach whereby a matrix of dis- 
tances between objects can be derived based on the partial preferences 
expressed by individuals on those objects. In this way, the similarities 
and differences between the various objects can subsequently be ana- 
lyzed. A graphical representation of objects can also be obtained from 
the distance matrix using classical multivariate techniques such as hier- 
archical classification and multidimensional scaling. 

Keywords: Preference structures, Object representation, Multivariate 
analysis, Classihcation. 



1 Introduction 

Preference structures have been an active area of research in the last years as they 
can be used to model preferences in a broad range of different applications. The 
appearance of the World Wide Web with a strong need for search engines and 
interactive tools for information access [9] has further magnified the importance 
of preference structures. Those structures are now pervasive in most systems to 
optimize access to knowledge. 

The need for preference structures in real-world applications requires the 
development of new tools and methods. In this way, beyond basic research on 
preference modeling, new topics are of interest. For example: 

— aggregation of preference structures to cope with metasearch engines (sys- 
tems that search different databases); 

— methods to compute similarities and distances between preferences {e.g. to 
cluster customers on the basis of their preferences); 
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— methods to compute similarities and distances between alternatives con- 
sidered in preferences {e.g. to cluster products on the basis of customers 
preferences) . 

Extensive research is documented in the literature on how to establish a sound 
basis for preference modeling. E.g. see [12] for a detailed description of results 
in this area; see [4] for a description of the field from a historical perspective; 
see also [11] for recent results in this area. Similarly, a large number of contribu- 
tions have been devoted to preference aggregation, especially for retrieval from 
multiple sources or by multiple engines. See [9], [14] or [13] for details on such 
systems and on operators for aggregation of preferences. For a more technical 
paper on aggregation see e.g. [7]. 

One of the applications of modeling and aggregating preferences is to com- 
pute similarities and distances between objects from the preferences expressed 
on them. The availability of distances between objects allows positioning those 
objects and assessing their relationships. Computation of distances over prefer- 
ences has been studied at length (see e.g. [10]). 

1.1 Our Contribution 

In several situations, a set of objects must be positioned based on the prefer- 
ences of a set of individuals. Sometimes, each individual can/does only include 
a limited subset of objects in his preferences (partial preferences). 

Example 1. Some real-life examples where partial preferences appear are the 
following: 

— A set of individuals must choose their 15 favorite or most visited web pages 
among a large set of web pages or even all web pages. 

— In order to analyze the consumer perception and preferences about several 
wine brands, a sample of consumers are asked to rank their favorite 5 wines. 

— Students wishing to access higher education in a certain state are asked 
to rank their 8 favorite degrees among the degrees offered by the various 
universities in the state. 

We present an approach whereby a matrix of distances between objects can be 
generated from the partial preferences expressed by individuals on those objects. 
In this way, the similarities and differences between the various objects can 
subsequently be analyzed. A graphical representation of objects can also be 
obtained from the distance matrix using classical multivariate techniques such 
as hierarchical classification and multidimensional scaling. 

Section 2 contains basic concepts and notation used in the rest of the paper. 
The construction of the distance matrix between objects is specified in Section 3. 
A practical application is described in Section 4. Section 5 contains some con- 
clusions. 
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2 Basic Concepts 

The goal of this work is the representation of a set of objects from the prefer- 
ences expressed by a set of individuals. The representation should be such that 
the similarities and the differences between objects become evident, i.e. that 
a relative positioning of objects arises. 

Let us assume that we have a set of K objects and n individuals. Each 
individual chooses a subset of k objects among the K available objects and she 
ranks the k chosen objects according to his preferences. This yields an n x fc 
preference matrix X = {xij}, for 1 < i < n and 1 < j < fc, where Xij represents 
the object ranked by the i-th individual in the j-th position in his order of 
preference. 

Depending on the value of k, we have two types of preference matrices: 

— Total preference matrix, li k = K, each individual expresses his preferences 
over the whole set of objects. Thus, every object appears in the preferences 
of every individual at some position. 

— Partial preference matrix. If fc < K , each individual only chooses a subset 
of k objects and ranks them according to his preferences. A specific object 
may not appear in the preferences of a specific individual. 

We will concentrate here on partial preference matrices, so we assume k < K. 

3 Construction of the Distance Matrix Between Objects 

In order to represent the K objects, two classical techniques of multivariate 
analysis will be used [8]: hierarchical classification and multidimensional scaling. 
Those techniques require a matrix of distances between objects. We describe 
in this section the construction of a distance matrix from a matrix of partial 
preferences. 

We start from the n x k preference matrix X = {xij}. We go through the 
following steps to compute the distance between two objects r and s. 

1 . Compute a similarity measure between r and s which takes into account the 
difference between the positions of those objects in the order of preferences of 
the individuals. The closer the positions of objects r and s in the individuals’ 
preferences, the more similar are the objects; conversely, the farther their 
positions, the less similar are objects. We present two possible approaches 
to implementing this similarity measure: 

(a) Uniform distribution based on the distance between positions. The idea is 
that the similarity between objects r and s contributed by an individual 
is proportional to the distance between the positions of preferences for 
those two objects expressed by the individual; if one or both objects 
were not chosen by that individual, then the similarity contribution for 
that individual is 0. Thus, the similarity sm\.^ contributed by the i-th 
individual takes values in the interval {a,b), where 0<a <6<1, if 
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both r and s were chosen by the i-th individual; it is 0 if r or s were 
not chosen by the f-th individual. One possible expression for is as 
follows 



srn„ 



b(fc-l)-a-(b-g)|jr-Ja| 

fc -2 

0 



if r, s e {xii,- • • ,Xik} 
otherwise 



( 1 ) 



where jr and js are the positions of objects r and s among the preferences 
of the *-th individual. Using Expression (1), if objects r and s are neigh- 
bors in the ranking (i.e., |jV ~ js\ = 1), one obtains = b (high simi- 
larity); if objects r and s are at maximal distance (i.e., \jr — js\ = k — 1), 
one obtains sml.g = a (low similarity). 

(b) Exponential distribution based on the distanee between positions. The 
idea is that the similarity between objects r and s contributed by an 
individual follows an exponential distribution between the positions of 
preferences for those two objects expressed by the individual. Thus, the 
similarity sm).,, contributed by the i-th individual can be expressed as 

_ fexp(-a|> - jsl) if r,s e 
''*1 0 otherwise ^ 



where the parameter a is chosen to determine a specific exponential 
scale. 

Regardless of whether uniform or exponential similarity is used, the overall 
similarity sm^s between objects r and s is computed as srurs = X^r=i 
2. Count the number q^s of individuals who have chosen both r and s among 
their k preferences, regardless of their positions. Formally speaking, 

i ^ f 1 if r-,s e {xii,-- ■ ,Xik} 

( 0 otherwise 



and Qrs = X)r=i ^rs- (Note that this computation does not make sense with 
a total preference matrix because one would have Qrs = n for all r, s.) 

3. Scale the similarity between objects r and s into the interval [0, 1] as 



Srs 



^ STTlrs ^ 
Qrs 



Q 

Qrs 



(3) 



The rationale of Expression (3) is explained next: 

— The ratio snirs/qrs yields a similarity value scaled between 0 and 1. 

~ One would like to avoid high values of Srs based on the choices of a very 
small number prs of individuals. That is the reason of the exponent in 
Expression (3): since 0 < snirs/qrs < 1, an exponent with a small Qrs 
reduces the value of Srs, whereas an exponent with a large Qrs has an 
amplifying effect. The constant Q is the average of the nonzero values in 
the matrix Q = {qij}, for i, j = 1, ■ ■ ■ , K , and is used to scale qrs in the 
exponent of Expression (3). 
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Table 1. Code, name, city and field of a subset of 23 university degrees offered in 
Catalonia 



Code 


Name of degree 


City 


Field 


1 


Biology 


Barcelona 


Science 


2 


Business administration 


Barcelona 


Social Sciences 


3 


Law 


Barcelona 


Social Sciences 


4 


Economy 


Barcelona 


Social Sciences 


5 


Humanities 


Barcelona 


Humanities 


6 


German translation and interpretation 


Barcelona 


Humanities 


7 


English translation and interpretation 


Barcelona 


Humanities 


8 


Erench translation and interpretation 


Barcelona 


Humanities 


9 


Audiovisual communication 


Barcelona 


Social Sciences 


10 


International trade 


Barcelona 


Social Sciences 


11 


Design 


Barcelona 


Engineering 


12 


Political and administration sciences 


Barcelona 


Social Sciences 


13 


Tourism 


Mataro 


Social Sciences 


14 


Industrial design 


Barcelona 


Engineering 


15 


Computer science 


Barcelona 


Engineering 


16 


Computer systems 


Barcelona 


Engineering 


17 


Business science 


Barcelona 


Social Sciences 


18 


Labor relations 


Barcelona 


Social Sciences 


19 


Business science 


Mataro 


Social Sciences 


20 


Architecture 


Barcelona 


Engineering 


21 


Business science (night) 


Mataro 


Social Sciences 


22 


Telematics 


Barcelona 


Engineering 


23 


Business science / Labor relations 


Barcelona 


Social Sciences 



4. The distance matrix D = {dij}, for i, j = 1, • • • , iC, can be derived from the 
similarity matrix S = {s^j} computed in the previous step. There are several 
options for deriving distances from similarities [8]. For any two objects r 
and s, we list three possible derivations: 



^rs — 1 ^rs 


(4) 


drs — ^rs 


(5) 


drs — \/ 1 ^‘rs 


(6) 



From the distance matrix D between objects, two classical techniques in 
multivariate analysis can be used: 

1. Hierarchical classijication(}^j , 6]). We choose the option of average linkage 
between groups to form new groups and obtain a dendrogram where the 
various objects are progressively clustered. The dendrogram is shaped like 
an inverted tree where leaves represent objects. The objects clustered at the 
lowest levels of the dendrogram (closest to leaves) are those with the highest 
similarity. 
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Fig. 1. Dendrogram of 23 Catalan university degrees 



2. Multidimensional scaling ([1]). The interpretation of the relationships be- 
tween objects is not obvious at all from direct examination of the distance 
matrix. The goal of multidimensional scaling is to plot in a graphic (typi- 
cally in two dimensions) the structure of the distance matrix (or the simi- 
larity matrix). In this way, we obtain a simpler and clearer visualization of 
the connections between objects. In our empirical work, we have used the 
PROXSCAL algorithm [2, 3]. 

4 Application: Positioning University Degrees 

Students wishing to enter the public university system of Catalonia are required 
to list their preferred degrees. Specifically, they can express up to fc = 8 pref- 
erences, where each preference specifies a degree in a certain university; the 
student ranks his 8 preferences from most preferred (1) to least preferred (8). In 
the academic year 2003-2004, the 8 Catalan public universities offered K = 378 
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Fig. 2. Scatterplot of 23 Catalan university degrees 



degrees. Also in that year, n = 42125 students entering the public university 
system expressed their (partial) preferences. Based on the full set of students’ 
partial preferences, we have constructed the distance matrix D for the 378 de- 
grees being offered. 

In order to compute D, uniform similarities have been used with an interval 
(a, 6) = (0.2, 0.8). The distance matrix has been derived from the similarity 
matrix by using the transformation described by Expresion (4) . 

The two aforementioned multivariate techniques (hierarchical classification 
and multidimensional scaling) have been applied to the distance matrix using 
version 11 of the SPSS statistical package. 

For clarity and space limitation, we next present graphics corresponding to 
a subset of fc = 23 degrees among the 378 total degrees. This subset of de- 
grees are offered by Universitat Pompeu Fabra of Barcelona and are listed in 
Table 1. Figure 1 depicts the dendrogram obtained by hierarchical classification. 
Figure 2 depicts the two-dimensional scatterplot obtained by multidimensional 
scaling. Clearly, early clustered leaves in Figure 1 and/or close points in Fig- 
ure 2 represent degrees perceived as similar. From the representations in the two 
figures, interesting inferences can be made: for example, two different degrees 
located at the same city are perceived as being closer than two equal university 
degrees located in distant cities. Such is the case for degrees 17 and 19, which 
are the same (business science) but located in two different cities; however, 17 
is closer to different degrees in the same city, like 2 and 4. This might be seen 
as an indication of the scarce mobility of students. 
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5 Conclusion 

Starting from a matrix of partial preferences expressed by a set of individuals on 
a set of objects, a matrix of distances between the objects has been constructed. 
The distance matrix already gives an idea about the relative positioning of the 
various objects from the point of view of the individuals. Thus, with this distance 
matrix, “close” and “far away” objects can readily be identified. 

If multivariate techniques like hierarchical classification and multidimensional 
scaling are used on the distance matrix, a better visualization of the relative po- 
sitioning of objects is obtained. Those techniques yield dendrograms and scat- 
terplots which are in fact “maps” representing how the set of objects is perceived 
by the individuals. 
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Abstract. This paper presents a comparative study of methods for clus- 
tering long-term temporal data. We split a clustering procedure into two 
processes: similarity computation and grouping. As similarity computa- 
tion methods, we employed dynamic time warping (DTW) and multiscale 
matching. As grouping methods, we employed conventional agglomera- 
tive hierarchical clustering (AHC) and rough sets-based clustering (RC). 
Using various combinations of these methods, we performed clustering 
experiments of the hepatitis data set and evaluated validity of the results. 
The results suggested that (1) complete- linkage (CL) criterion outper- 
formed average-linkage (AL) criterion in terms of the interpret-ability 
of a dendrogram and clustering results, (2) combination of DTW and 
CL-AHC constantly produced interpretable results, (3) combination of 
DTW and RC would be used to find the core sequences of the clusters, 
(4) multiscale matching may suffer from the treatment of ’no-match’ 
pairs, however, the problem may be eluded by using RC as a subsequent 
grouping method. 



1 Introduction 

Clustering of time-series data [1] has been receiving considerable interests as 
a promising method for discovering interesting features shared commonly by 
a set of sequences. One of the most important issue in time-series clustering is 
determination of (dis-) similarity between the sequences. Basically, the similarity 
of two sequences is calculated by accumulating distances of two data points that 
are located at the same time position, because such a distance-based similar- 
ity has preferable mathematical properties that extend the choice of grouping 
algorithms. However instead, this method requires that the lengths of all se- 
quences be the same. Additionally, it cannot compare structural similarity of 
the sequences; for example, if two sequences contain the same number of peaks, 

* This work was supported in part by the Grant-in- Aid for Scientific Research on Pri- 
ority Area (B) (No. 759) “Implementation of Active Mining in the Era of Information 
Elood” by the Ministry of Education, Culture, Science and Technology of Japan. 
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but at slightly different phases, their ’difference’ is emphasized rather than their 
structural similarity [2]. 

These drawbacks are serious in the analysis of time-series data collected over 
long time. The long time-series data have the following features. First, the lengths 
and sampling intervals of the data are not uniform. Starting point of data ac- 
quisition would be several years ago or even a few decades ago. Arrangement of 
the data should be performed, however, shortening a time-series may cause the 
loss of precious information. Second, long-time series contains both long-term 
and short-term events, and their lengths and phases are not the same. Addition- 
ally, the sampling interval of the data would be variant due to the change of 
acquisition strategy over long time. 

Some methods are considered to be applicable for clustering long time series. 
For example, dynamic time warping (DTW) [3] can be used to compare the two 
sequences of different lengths since it seeks the closest pairs of points allowing 
one-to-many point matching. This feature also enable us to capture similar events 
that have time shifts. Another approach, multiscale structure matching [6] [5], 
can also be used to do this work, since it compares two sequences according 
to the similarity of partial segments derived based on the inflection points of 
the original sequences. However, there are few studies that empirically evaluate 
usefulness of these methods on real-world long time-series data sets. 

This paper reports the results of empirical comparison of similarity measures 
and grouping methods on the hepatitis data set [7]. The hepatitis dataset is 
the unique, long time-series medical dataset that involves the following features: 
irregular sequence length, irregular sampling interval and co-existence of clini- 
cally interesting events that have various length (for example acute events and 
chronic events). We split a clustering procedure into two processes: similarity 
computation and grouping. For similarity computation, we employed DTW and 
multiscale matching. For grouping, we employed conventional agglomerative hi- 
erarchical clustering [8] and rough sets-based clustering [9], focusing that these 
methods can be used as un-supervised methods and are suitable for handling 
relative similarity induced by multiscale matching. For every combination of the 
similarity computation methods and grouping methods, we performed clustering 
experiments and evaluated validity of the results. 



2 Materials 

We employed the chronic hepatitis dataset [7], which were provided as a common 
dataset for ECML/PKDD Discovery Challenge 2002 and 2003. The dataset con- 
tained long time-series data on laboratory examinations, which were collected at 
Chiba University Hospital in Japan. The subjects were 771 patients of hepatitis 
B and C who took examinations between 1982 and 2001. We manually removed 
sequences for 268 patients because biopsy information was not provided for them 
and thus their virus types were not clearly specified. According to the biopsy 
information, the expected constitution of the remaining 503 patients were, B 
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/ C-noIFN / C-IFN = 206 / 100 / 197. However, due to existence of missing 
examinations, the numbers of available sequences could be less than 503. 

The dataset contained the total of 983 laboratory examinations. However, in 
order to simplify our experiments, we selected 13 items from blood tests relevant 
to the liver function: ALB, ALP, G-GL, G-GTP, GOT, GPT, HGB, LDH, PLT, 
RBG, T-BIL, T-GHO and TTT. Details of each examination are available at the 
URL [7]. 

Each sequence originally had different sampling intervals from one day to 
one year. From preliminary analysis we found that the most frequently appeared 
interval was one week; this means that most of the patients took examinations on 
a fixed day of a week. According to this observation, we determined re-sampling 
interval to seven days. A simple summary showing the number of data points 
after re-sampling is as follows (item=ALB, n = 499) : mean=456.87, sd=300, 
maximum=1080, minimum=7. Note that one point equals to one week; therefore, 
456.87 points equals to 456.87 weeks, namely, about 8.8 years. 



3 Methods 

We have implemented algorithms of symmetrical time warping described in [2] 
and one-dimensional multiscale matching described in [4] . We also implemented 
two clustering algorithms, conventional agglomerative hierarchical clustering 
(AHG) in [8] and rough sets-based clustering (RG) in [9]. For AHG we employed 
two linkage criteria, average-linkage AHG (GL-AHG) and complete-linkage AHG 
(AL-AHG). Methodologies of multiscale matching and rough clustering are 
briefly described in sections 3.1 and 3.2 for readers’ understandings. 

In the experiments, we investigated the usefulness of various combinations 
of similarity calculation methods and grouping methods in terms of the inter- 
pretability of the clustering results. Procedures of data preparation were as fol- 
lows. First, we selected one examination, for example ALB, and split the cor- 
responding sequences into three subsets, B, G-noIFN and G-IFN, according to 
the virus type and administration of interferon therapy. Next, for each of the 
three subgroups, we computed dissimilarity of each pair of sequences by using 
DTW. After repeating the same process with multiscale matching, we obtained 
2x3 sets of dissimilarities: one obtained by DTW, and another obtained by 
multiscale matching. Then we applied grouping methods AL-AHG, GL-AHG 
and RG to each of the three dissimilarity sets obtained by DTW. This yielded 
3x3=9 sets of clusters. After applying the same process to the sets obtained 
by multiscale-matching, we obtain the total of 18 sets of clusters. The above 
process is repeated with the remaining 12 examination items. Gonsequently, we 
constructed 12 x 18 clustering results. Note that in this experiments we did 
not perform cross-examination comparison, for example comparison of an ALB 
sequence with a GPT sequence. 

We used the following the parameter Th = 0.3 for rough clustering. In AHG, 
cluster linkage was terminated when increase of dissimilarity firstly exceeded 
mean-|-SD of the set of all increase values. 
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3.1 Multiscale Matching for Time-Series Data 

Multiscale matching, proposed by Mokhtarian [6], is originally developed as 
a method for comparing two planar curves by partly changing observation scales. 
It divides a contour of the object into partial contours based on the place of in- 
flection points. After generating partial contours at various scales for each of 
the two curves to be compared, it finds the best pairs of partial contours that 
minimize the total dissimilarity while preserving completeness of the concate- 
nated contours. This method can preserve connectivity of partial contours by 
tracing hierarchical structure of inflection points on the scale space. Since each 
ends of a partial contour exactly corresponds to an inflection point and the 
correspondence between inflection points at different scales are recognized, the 
connectivity of the partial contours are guaranteed. 

We have extended this method so that it can be applied to the comparison 
of two one-dimensional temporal sequences. A planar curve can be redefined 
as a temporal sequence, and a partial contour can be analogously redefined as 
a subsequence. Now let us introduce the basics of multiscale matching for one- 
dimensional temporal sequence. First, we represent time-series A using multiscale 
description. 

Let x(t) represent an original temporal sequence of A where t denotes a 
time of data acquisition. The sequence at scale tr, X(t, a), can be represented as 
a convolution of x(t) and a Gauss function with scale factor a, g{t, a), as follows: 



X{t,a) 



x{t) (g) g{t, a) 




x{u) 



1 







( 1 ) 



Figure 1 shows an example of sequences in various scales. From Figure 1 and the 
function above, it is obvious that the sequence will be smoothed at higher scale 
and the number of inflection points is also reduced at higher scale. Curvature of 
the sequence can be calculated as 



K{t,a) 



X" 

(1-h A'2)3/2’ 



( 2 ) 



where X' and X” denotes the first- and second-order derivative of A(t,cr), re- 
spectively. The m-th derivative of X{t, cr), a), is derived as a convolution 

of x(t) and the m-th order derivative of g{t, a), cr), as 



= x{t)®g^^\t,a). (3) 

The next step is to find inflection points according to change of the sign of 
the curvature and to construct segments. A segment is a subsequence whose ends 
respectively correspond to the adjacent inflection points. Let be a set of N 
segments that represents the sequence at scale cr^^b can be represented as 

A(G = |aW I i 



( 4 ) 
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Fig. 1. Multiscale matching 



In the same way, for another temporal sequence B, we can obtain a set of seg- 
ments at scale as 

= = (5) 

where M denotes the number of segments of B at scale 

The main procedure of multiscale structure matching is to find the best set 
of segment pairs that minimizes the total difference. Figure 1 illustrates the 
process. For example, five contiguous segments at the lowest scale of Sequence A 
are integrated into one segment at the highest scale, and the integrated segments 
well match to one segment in Sequence B at the lowest scale. Thus the set of 
the five segments in Sequence A and the one segment in Sequence B will be 
considered as a candidate for corresponding subsequences. While, another pair of 
segments will be matched at the lowest scale. In this way, matching is performed 
throughout all scales. The resultant set of segment pairs must not be redundant 
or insufficient to represent the original sequences. Namely, by concatenating all 
subsequences in the set, the original sequence must be completely reconstructed 
without any partial intervals or overlaps. The matching process can be fasten by 
implementing dynamic programming scheme [5]. 

The total difference between sequences A and B is defied as a sum of dissim- 
ilarities of all matched segment pairs as 

p 

D{A,B) = Y,d{a^°\b(°^), ( 6 ) 

p^l 

where P denotes the number of matched segment pairs. The notation 

denotes dissimilarity of segment pairs and at scales k and h defined 
below. 

d{af\hf^) = u\&-K{e,l,(j),g), (7) 

where 9, I, (f>, g respectively represent differences on rotation angle, length, phase 
and gradient of segments and at scales k and h. These differences are 
defined as follows: 

9{al^\bf^) =\ 9i^'> -9if \/2n, 



( 8 ) 
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Fig. 2. Segment difference 




( 11 ) 



( 10 ) 



( 9 ) 



Figure 2 provides an illustrative explanation of these terms. 

Multiscale matching usually suffers from the shrinkage of curves at high scales 
caused by excessive smoothing with a Gaussian kernel. On one-dimensional time- 
series data, shrinkage makes all sequences ffat at high scales. In order to elude 
this problem, we applied shrinkage correction proposed by Lowe [10]. 

3.2 Rough Clustering 

Rough clustering is a clustering method based on the indiscernibility degree of 
objects. The main benefit of this method is that it can be applied to proximity 
measures that do not satisfy the triangular inequality. Additionally, it may be 
used with a proximity matrix - thus it does not require direct access to the 
original data values. 

Let us first introduce some fundamental definitions of rough sets related to 
our work. Let U ^ (f> he a. universe of discourse and A be a subset of U. An 
equivalence relation, R, classifies U into a set of subsets U/R= {Ai, A 2 , ...A^} 
in which following conditions are satisfied: 



Any subset A^, called a category, represents an equivalence class of R. A category 
in R containing an object x G U is denoted by [ccj/j. An indiscernibility relation 
IND(R) is defined as follows. 



{l)Xi CU,Xi^ 4> for any i, 

(2) Xi n Xj = (j) for any i,j, 

(3) Ui=i^2,...rt Xi = U. 



XilN D{R)xj = {{xi,Xj) £ I (xi,Xj) £ P,P £ U/R}. 
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Stepi 





Step2 



Fig. 3. Rough clustering 



For a family of equivalence relations P C R, an indiscernibility relation over P 
is denoted by IND(P) and defined as follows 

IND(P) = Pi IND{R). 
itep 

The clustering method consists of two steps: (l)assignment of initial equiv- 
alence relations and ( 2 )iterative refinement of initial equivalence relations. Fig- 
ure 3 illustrates each step. In the first step, we assign an initial equivalence 
relation to every object. An initial equivalence relation classifies the objects into 
two sets: one is a set of objects similar to the corresponding objects and an- 
other is a set of dissimilar objects. Let U = {xi, X2, ..., a:„} be the entire set of n 
objects. An initial equivalence relation Ri for object Xi is defined as 



Pi — I ^ ' 

where Pi denotes a set of objects similar to Xi, that is, a set of objects whose 
similarity to Xi is larger than a threshold value. Similarity s{xi,Xj) is an output 
of multiscale structure matching where Xi and Xj correspond to A and B in the 
previous subsection respectively. Threshold value Si is determined automatically 
at a place where s largely decreases. A set of indiscernible objects obtained 
using all sets of equivalence relations forms a cluster. In other words, a cluster 
corresponds to a category Xi of U/IND{R). 

In the second step, we refine the initial equivalence relations according to 
their global relationships. First, we define an indiscernibility degree, 7, which 
represents how many equivalence relations commonly regards two objects as 
indiscernible objects, as follows: 

1 

j(Xi,Xj) = — ^4(xi,Xj), 



Sk{xi,Xj) 



1, if [xk]R^ n {[xi]R^ n [Xj]flj yf <j) 

0, otherwise. 



Objects with high indiscernibility degree can be interpreted as similar objects. 
Therefore, they should be classified into the same cluster. Thus we modify an 
equivalence relation if it has ability to discern objects with high 7 as follows: 



r- = {{pi},{u-p:}}, 
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Table 1. Comparison of the number of generated clusters. Each item represents 
clusters for Hepatitis B / C-noIFN / C-IFN cases 



Exam 


Number of 
Instances 


Number of Generated Clusters 


DTW 


Multiscale 


Matching 






AL-AHC 




-AHC 


RCT 






AL-AHC 


CL-AHC 


RC 






ALB 


204 


7 


99 


7 


196 


8/3 


/ 3 


10 


/ 6 


7 


5 


38 / 22 


/ 


32 


19 / 11 / 12 


22 


7 


21 


7 


27 


6/14 


/ 


31 


ALP 


204 


/ 


99 


/ 


196 


6/4 


/ 6 


7 / 


7/10 


21 / 12 


/ 


29 


10 / 18 / 14 


32 


/ 


16 


/ 


14 


36 / 12 


/ 


46 


G-GL 


204 


/ 


97 


/ 


195 


2/2 


/ 5 


2 / 


2 ! 


1 




1 / 1 


/ 


21 


15 / 16 / 194 


16 / 


24 / 


194 


24/3 


/ 


49 


G-GTP 


204 


/ 


99 


/ 


196 


2 / 4 / 


11 


2 


/ 6 


/ 


7 


1 / 17 


/ 4 


38 / 14 / 194 


65 


/ 


14 


/ 


19 


35/8 


/ 


51 


GOT 


204 


/ 


99 


/ 


196 


S / 10 / 


25 


8 


/ 4 


/ 


7 


50 / 18 


/ 


60 


19 / 12 / 24 


35 


/ 


19 


/ 


19 


13 / 14 


/ 


15 


GPT 


204 


/ 


99 


/ 


196 


3/17 


/ 7 


7 


/ 4 


/ 


7 


55 / 29 


/ 


51 


23 / 30 /8 


24 


/ 


16 


/ 


16 


11/7 


/ 


25 


HGB 


204 


/ 


99 


/ 


196 


3 / 4 / 


13 


2 


/ 3 


/ 


9 


1/16 


/ 


37 


43 / 15 / 15 


55 


/ 


19 


/ 


22 


1/12 


/ 


78 


LDH 


204 


/ 


99 


/ 


196 


7/7 


/ 9 


15 / 


10 


/ 


8 


15 / 15 


/ 


15 


20 / 25 / 195 


24 


/ 


9 / 195 


32 / 16 


/ 


18 


PLT 


203 


/ 


99 


/ 


196 


2/13 


/ 9 


2 


/ 7 


/ 


6 


1/15 


/ 


19 


33/5/12 


34 


/ 


15 


/ 


17 


1 / 11 


/ 


25 


RBC 


204 


/ 


99 


/ 


196 


3 / 4 


/ 6 


3 


/ 4 


/ 


7 


1/14 


/ 


26 


32 / 16 / 13 


40 


/ 


23 


/ 


17 


1 / 6 


/ 


17 


T-BIL 


204 


/ 


99 


/ 


196 


6/5 


/ 5 


9 


/ 5 


/ 


4 


203 / 20 


/ 


30 


17/25/6 


20 / 


30 / 


195 


11 / 23 


/ 


48 


T-CHO 


204 


/ 


99 


/ 


196 


2/2 


/ 7 


5 


/ 2 


/ 


5 


20/1 


/ 


27 


12 / 13 / 13 


17 


/ 


23 


/ 


19 


12/5 


/ 


23 


TTT 


204 


/ 


99 


/ 


196 


7/2 


/ 5 


8 


/ 2 


/ 


6 


25/1 


/ 


32 


29/10/6 


39 


/ 


16 


/ 


16 


25 / 16 


/ 


23 



P' = {xj\"f{xi,Xj) >Th}, yxjGU. 

Tfi is a threshold value that determines indiscernibility of objects. This prevents 
generation of small clusters formed due to the too fine classification knowl- 
edge. Given Th, refinement of equivalence relations is iterated until clusters 
become stable. Consequently, coarsely classified set of sequences are obtained as 
U/IND{R'). 

4 Results 

Table 1 provides the numbers of generated clusters for each combination. Let us 
explain the table using the raw whose first column is marked ALB. The second 
column “Number of Instances” represents the number of patients who took the 
ALB examination. Its value 204/99/196 represents that 204 patients of Hepatitis 
B, 99 patients of Hepatitis C (who did not take IFN therapy) and 196 patients 
of Hepatitis C (who took IFN therapy) took this examination. Since one patient 
has one time-series examination result, the number of patients corresponds to 
the number of sequences. The third column shows the number of generated 
clusters. Using DTW and AL-AHC, 204 hepatitis B sequences were grouped 
into 8 clusters. 99 C-noIFN sequences were grouped into 3 clusters, as well as 
196 C-IFN sequences. 

4.1 DTW and AHCs 

Let us first investigate the case of DTW-AHC. Comparison of DTW-AL-AHC 
and DTW-CL-AHC implies that the results can be different if we use differ- 
ent linkage criterion. Figure 4 left image shows a dendrogram generated from 
the GPT sequences of type B hepatitis patients using DTW-AL-AHC. It can 
be observed that the dendrogram of AL-AHC has an ill-formed structure like 
’chaining’, which is usually observed with single- linkage AHC. For such an ill- 
formed structure, it is difficult to find a good point to terminate merging of 
the clusters. In this case, the method produced three clusters containing 193, 9 
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Fig. 4. Dendrograms for DTW-AHC-B. Left: AHC-AL. Right: AHC-CL 




Fig. 5. Examples of the clusters. Left: AHC-AL. Right: AHC-CL 



and 1 sequences respectively. Figure 5 left image shows a part of the sequences 
grouped into the largest cluster. Almost all types of sequences were included in 
this cluster and thus no interesting information was obtained. 

On the contrary, the dendrogram of CL-AHC shown in the right of Figure 4 
demonstrates a well formed hierarchies of the sequences. With this dendrogram 
the method produced 7 clusters containing 27, 21, 52, 57, 43, 2, and 1 sequences. 
Figure 5 right image examples of the sequences grouped into the first cluster. 
One can observe interesting features for each cluster. The first cluster contains 
sequences that involve continuous vibration of the OPT values. These patterns 
may imply that the virus continues to attack the patient’s body periodically. The 
second cluster contains very short, meaningless sequences, which may represent 
the cases that patients stop or cancel receiving the treatment quickly. The third 
cluster contains another interesting pattern: vibrations followed by the flat, low 
values. This case may represent the cases that the patients were cured by some 
treatments, or naturally. 

4.2 DTW and RC 

For the same data, rough set-based clustering method produced 55 clusters. Fifty 
five clusters were too many for 204 objects, however, 41 of 55 clusters contained 
less than 3 sequences, and furthermore, 31 of them contained only one sequence. 
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Fig. 6. Examples of the clusters obtained by RC. Left: the second cluster containing 
16 sequences. Right: the third cluster containing 10 sequences 



This was because of the rough set-based clustering tends to produce independent, 
small clusters for objects being intermediate of the large clusters. Ignoring small 
ones, we found 14 clusters containing 53, 16, 10, 9, 6 . . . objects. The largest 
cluster contained short sequences quite similarly to the case of CL-AHC. Figure 6 
shows examples of sequences for the second and third clusters. Because this 
method evaluates the indiscernibility degree of objects, each of the generated 
clusters contains strongly similar sets of sequences. Although populations in 
the clusters are not so large, one can clearly observe the representative of the 
interesting patterns described previously at CL-AHC. 

4.3 Multiscale Matching and AHCs 

Comparison of Multiscale Matching-AHC pairs with DTW-AHC pairs shows 
that Multiscale Matching’s dissimilarities resulted in producing the larger num- 
ber of clusters than DTW’s dissimilarities. 

One of the important issues in multiscale matching is treatment of ‘no-match’ 
sequences. Theoretically, any pairs of sequences can be matched because a se- 
quence will become single segment at enough high scales. However, this is not 
a realistic approach because the use of many scales results in the unacceptable 
increase of computational time. If the upper bound of the scales is too low, the 
method may possibly fail to find the appropriate pairs of subsequences. For ex- 
ample, suppose we have two sequences, one is a short sequence containing only 
one segment and another is a long sequence containing hundreds of segments. 
The segments of the latter sequence will not be integrated into one segment until 
the scale becomes considerably high. If the range of scales we use does not cover 
such a high scale, the two sequences will never be matched. In this case, the 
method should return infinite dissimilarity, or a special number that identifies 
the failed matching. 

This property prevents AHCs from working correctly. CL-AHC will never 
merge two clusters if any pair of ’no-match’ sequences exist between them. AL- 
AHC fails to calculate average dissimilarity between two clusters. Figure 7 pro- 
vides dendrograms for GPT sequences of Hepatitis C (with IFN) patients ob- 
tained by using multiscale matching and AHCs. In this experiment, we let the 
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Fig. 7. Dendrograms for MSMmatch-AHC-C-IFN. Left: AHC-AL. Right: AHC-CL 




Fig. 8. Examples of the sequences clusters obtained by AHCs. Left: AHC-AL. The 
first cluster containing 182 sequences. Right: AHC-CL. the first cluster containing 71 
sequences 



dissimilarity of ’no-match’ pairs the same as the most dissimilar ’matched’ pairs 
in order to elude computational problems. The dendrogram of AL-AHC is com- 
pressed to the small-dissimilarity side because there are several pairs that have 
excessively large dissimilarities. The dendrogram of CL-AHC demonstrates that 
the ’no-match’ pairs will not be merged until the end of the merging process. 

For AL-AHC, the method produced 8 clusters. However, similarly to the 
previous case, most of the sequences (182/196) were included in the same cluster. 
As shown in Figure 8 left image, no interesting information was found in the 
cluster. For CL-AHC, the method produced 16 clusters containing 71, 39, 29, 
. . . sequences. Figure 8 right image provide examples of the sequences grouped 
into the first primarycluster. 



4.4 Multiscale Matching and RC 

Rough set-based clustering method produced 25 clusters containing 80, 60, 18, 
6 . . . sequences. Figures 9 represent examples of the sequences grouped into the 
second and third primary clusters. It can be observed that the sequences were 
properly clustered into the three major patterns: continuous vibration, flat after 
vibration, and short. This should result from the ability of the clustering method 
for handling relative proximity. 
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Fig. 9. Examples of the clusters obtained by RC. Left: the second cluster containing 
16 sequences. Right: the third cluster containing 10 sequences 



5 Conclusions 

In this paper we have reported a comparative study of clustering methods 
for long time-series data analysis. Although the subjects for comparison were 
limited, the results suggested that (1) complete-linkage criterion outperforms 
average-linkage criterion in terms of the interpret-ability of a dendrogram and 
clustering results, (2) combination of DTW and CL-AHC constantly produced 
interpretable results, (3) combination of DTW and RC would be used to find 
core sequences of the clusters. Multiscale matching may suffer from the prob- 
lem of ’no-match’ pairs, however, the problem may be eluded by using RC as 
a subsequent grouping method. 
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Abstract. As a model of information retrieval on the WWW, a fuzzy 
multiset model is overvievred and a family of fuzzy document clustering 
algorithms is developed. The fuzzy multiset model is enhanced in order 
to adapt clustering applications. The standard proximity measure of the 
cosine coefficient is generalized in the multiset model, and two basic ob- 
jective functions of fuzzy c-means are considered. Moreover two methods 
of handling nonlinear classification is proposed: introduction of a clus- 
ter volume variable and a kernel trick used in support vector machines. 
A crisp c-means algorithm and clustering by competitive learning are 
also studied. A numerical example based on real documents is shown. 



1 Introduction 

Many studies are focused upon information retrieval models on the WWW. 
Two features on information should be noted: first, retrieved data item is with 
a degree of relevance; second, the same information item can occur several times 
in a retrieved set by a query. These two features are best captured by a fuzzy 
multiset model [7, 5] also called fuzzy bag [9]. 

In this paper we first overview fuzzy multiset model for information retrieval, 
introducing new multiset operations for the purpose of document clustering. 

We moreover show a number of new methods of document clustering in which 
fuzzy multiset space is employed. Nonlinearity in separating clusters are em- 
phasized while standard techniques of fc-means, fuzzy c-means, clustering using 
competitive learning have linear cluster boundaries. 

Three classes of algorithms of clustering and two methods handling nonlin- 
earities are considered. The three algorithms are (A) crisp c-means, (B) fuzzy c- 
means, and (C) clustering using competitive learning, while the two methods 
are (I) the use of a variable controlling cluster volumes, and (II) the kernel trick 
employed in support vector machines [10]. 

The six methods by combining (a), (b), (c) and (I), (II) are formulated and 
the corresponding iterative algorithms to obtain clusters are derived. 
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A numerical example is given in which construction procedure of the fuzzy 
multiset space is shown. The above methods are applied to the example and the 
effectiveness of the algorithms is discussed. 

2 Fuzzy Multiset Model for Information Retrieval 

2.1 Fuzzy Multisets 

Fuzzy multiset A of A (often called fuzzy bag) is characterized by the func- 
tion Ca{') of the same symbol, but the value Ca{x) is a finite multiset in J = [0,1] 
(see [9]). In other words, given x G X, 

Ca(x) = {fi, n',..., fi"}, /i, , /i" e I. 

Assume fj! , . . . , jj!' are nonzero for simplicity. We write 

A = or A = {{x,^i),{x,n'),. . . 

As a data structure, we introduce an infinite-dimensional vector: 

Ca(x) = (^,/r',...,/r",0,0,...). 

Collection of such vectors is denoted by V: 

V = { (/i, , fJ-”, 0,0,...): /r', . . . , /i" G / } 

A sorting operation to multisets in / is important in defining operations 
for fuzzy multisets. This operation denoted by S' (S: V — > V) rearranges the 
sequence in V into the decreasing order: 

S((/r, /r', 0, 0, ...)) = (z/i, ..., I/P, 0, 0, 0) 

where > . . . > > 0 and {/x, - . - , , x^^}. Thus we 

can assume 

Ca(x) = (i/\i/',...,i/P,0,0,0) (1) 

The above sorted sequence for Ca(x) is called the standard form for a fuzzy 
multiset, as many operations are defined in terms of the standard form [7, 5]. 

Additional operations on V are necessary in order to define fuzzy multiset 
operations. Assume 

h — (^1, ^2; ■ - ■ ; ■ • ■ : b; d; • ■ *); ^ — (^1; ^2: ■ • ■ : ^q: ■ - ■ , 0; • ■ •) ^ V 

where ki {i = 1, . . . ,r) and Ij {j = 1, ... ,s) are nonzero. 

Then we define 



ky I = (max{/ci, Zi}, . . . , maxjfcq, lq \, . . .), 
k Al = (minjfci, /i}, . . . ,min{fc,, Iq}, . . .), 
k • I — {kl ' llj . . . , kq • Iqj . . .), 

/c I ^ — (^1 1 k>2 ; ■ • ■ 7 kj. 
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Assume k\ > k 2 > . . . > kr- Then we define 

r 

I k 1= 

2 = 1 

Moreover we define inequality of the two vectors: 



k < I 



ki li-j % — 1 ; 2 , 



( 2 ) 

( 3 ) 



We now define fuzzy multiset operations. ^ 



1. inclusion: 



A CB S{Ca{x)) < S{Cb{x)), Vx e A. 

2. equality: 

A = B ^ S{Ca{x)) = S{Cb{x)), Vx G A. 

3. union: 

Caub{x) = S{Ca{x)) V ^(^^(x)). 

4. intersection: 



5. sum: 

6. product: 

7. Cardinality: 



CAnsix) = S{Ca{x)) a S{Cb{x)). 
Ca+b{x) = S{S{Ca{x)) I ^(Cb(x))). 
Ca.b{x) = S{Ca{x)) ■ S{Cb{x)). 
\A\=Y,\Ca{x)\. 

x^X 



Example. Suppose A = {a,b,c,d} and A = {{0.3, 0.5}/a, {0.7}/6, {0.9}/c}, 
B = {{0.4, 0.4}/a, {0.2, 0.5}/&}. We can represent A as 

A={(a,0.3),(a,0.5),(6,0.7),(c,0.9)}. 

Notice A = {(0.5,0.3)/a,(0.7)/6,(0.9)/c{ and B = {(0.4, 0.4)/a, (0.5, 0.2)/6} in 
the standard form, where zero elements are ignored. We have 



A UB = {(0.5,0.4)/a, (0.7,0.2)/6, (0.9)/c|, 

A nS = {(0.4,0.3)/a, (0.5)/6|, 

A + B = {(0.5,0.4,0.4,0.3)/a, (0.7,0.5,0.2)/6,(0.9)/c|. 

^ The operations of fuzzy multisets herein use a new notation of the sorting operation 
S(-). Consequently representations of the basic operations are made more compact 
than those in former studies (cf.[5, 7]). 
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2.2 Spaces of Fuzzy Multisets 

We have assumed Ca{x) is finite at the beginning. However, extension to infinite 
fuzzy multisets is straightforward: Ca(x) = {v^ , . . . in which we can 

arrow infinite nonzero elements. A reasonable assumption to this sequence is 
— > 0 (j ^ +oo). 

Metric spaces are defined on the collection of fuzzy multisets of X. Let 
Ca{.x) = {v\, . . . , . . .), Cb{,x) = {i^B , . . . , . . .), 

Then we can define 

OO 

x^X j—1 

which is an type metric. Moreover we can also define 



d2(A,B) = 



\ 



EEi 

xex j=i 






as the £2 type metric. Moreover a scalar product < A,B > using the algebraic 
product is introduced in the latter space: 



< A,B > = Ca-b{x). 

x^X 



We then have 



d2{A, B) = <A,A> + <B,B>-2<A,B>. 

The d 2 {A, B) metric naturally induces a norm: 

||A|| = y^MA^. 

It is not difficult to see the metric space with di is extended to a Banach space 
and that with < A, B > a, Hilbert space, since it is straightforward to define 
cnst ■ A: multiplication of A by a constant cnst. 

These metrics are useful in discussing fuzzy multiset model for data cluster- 
ing. In [8], we have demonstrated linear document clustering algorithms based 
on these spaces. Here, nonlinear clustering algorithms are introduced in the next 
section. 

3 Nonlinear Clnstering Algorithms 

It is well-known that the methods of crisp and fuzzy c-means clustering provide 
linear cluster boundaries. Namely, the boundary between two clusters obtained 
by these methods are linear. However, it is also known that real-world examples 
require methods of separating classes having nonlinear boundaries. 
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In the following we distinguish two kinds of nonlinearities. First kind is called 
mild nonlinearity and second strong nonlinearity. A typical mild nonlinearity 
occurs as follows. Let us suppose two spherical clusters of objects are given. 
A cluster is large while the other is small. A c-means clustering can separate 
the two clusters, but if the two clusters are close enough, a part of the larger 
cluster is misclassified into the smaller cluster, since the boundary by the clus- 
tering algorithm is the Voronoi boundary of the two regions with the two cluster 
centers [4]. Such a nonlinearity can be handled by using an additional variable 
controlling cluster volume sizes which will be discussed below. 

There are, however, nonlinearities that cannot be handled in this way. For 
handling such strong nonlinearities, we employ a kernel trick in support vector 
machines [10]. 



3.1 Objective Functions for Fuzzy c-Means 

The objects to be clustered herein may be documents, or they can be items of 
information on the web. We call them documents or simply objects; they are 
denoted by A = {xi,...,x„}. A document is represented by a fuzzy multiset of 
keywords. Thus a document is an element of a fuzzy multiset space. Accordingly, 
a document Xk is identified with the corresponding fuzzy multiset, namely, the 
same symbol Xk is used for both the document and the fuzzy multiset. We use 
the distance d 2 which is regarded as a Hilbert space with the scalar product 

< Xk,Xi >. 

The proximity measure used for clustering is a generalization of the cosine 
correlation in the fuzzy multiset space: 



s{xk,xe) 



< Xk.Xl > 



We use two kinds of objective functions for fuzzy c-means clustering. The 
first with the index fern is the standard function by Dunn and Bezdek [1], while 
the second with the index efcm has been introduced by the authors [G, 5]. 

c n 

Jfam(U, V) = ~Y, Vi) 

2=1 k—1 



c n 

J ef cm{U ^ ^ ^ ^ ^ '^ik 

2=1 A .-1 

where the membership Uik of Xk to cluster i subjects to the constraint 

c 

M = ^Uik ■■ '^Uik = 1, Vfc; Uik > 0, Vi, fc |. 



Moreover V = {vi,. . . ,Vc) are cluster centers. 
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The iterative solutions for the clustering are obtained by alternative mini- 
mization of these functions: min Jfcm(U, V) while fixing V to be the last min- 
imizing element, and imn Jfcrn{U,V) while fixing U to be the last minimizing 

element. The solutions are omitted, as they are already-known [8]. In [8], met- 
rics on fuzzy multiset spaces are discussed and algorithms for calculating cluster 
centers are derived. It has been proved there that the centers are well-defined 
fuzzy multisets. It should also be noted, however, that nonlinearities discussed 
below have not been considered yet. 



Variable for Controlling Cluster Volumes. It is convenient to consider 
a generalization of the second objective function 

c n 

Jef cma {U,V,a) = - EE [uiks{xk,Vi) - X ^Uik log Uik/ai] 

i=i k=i 

where a = (oi, . . . ,Oc) is c-dimensional variable for controlling cluster volume 
sizes, which subjects to the constraint: 

C 

A = { a : = 1, Oj > 0, Vj |. 

i=l 

The next alternative minimization with respect to the variables a, U, V is used. 

Algorithm FCM: generalized fuzzy c-means. 

FCMO. Set initial values for a,U,V. 

FCMl. Solve min Jfcma{U,V, a) while fixing U and V to the last minimizing 

cxGA 

elements. 

FCM2. Solve min Jfcma{U,V, a) while fixing V and a to the last minimizing 
elements. 

FCM3. Solve min Jfcma{U,V, a) while fixing U and a to the last minimizing 
elements. 

FCM4. If the solution {U,V,a) is convergent, stop; otherwise go to FCMl. 
End FCM. 



The minimizing solutions are as follows. 



E n 

_ k=l'^ik 

iXi — 

n 

_ aj exp{Xs{xk,Vi)) 

exp(As(xfc, Uj)) 

_ J2k=l "^ikXk 

' WHUu^kXkW 

Notice that all operations are properly defined on the fuzzy multiset space. 



( 4 ) 

( 5 ) 

( 6 ) 
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3.2 Kernel Trick 

Kernel trick is a well-known technique in support vector machines [10] by which 
nonlinear classification is effectively realized. A crisp c- means clustering algo- 
rithm using kernels has been proposed by Girolami [3] . Here we consider kernels 
in this model in order to handle nonlinear clustering. 

The kernel trick implies that we consider a mapping of an object x into 
another high-dimensional space <P{x). The mapping <?(•) is not explicitly known 
but the scalar product < <l>{x),<P{y) > is given as a kernel function/C(a;, y). Here 
we consider the Gaussian kernel which is most frequently used: 

K{xk,xe) =< ^{xk),^{xi) >= eyip{-const\\x - yjp). 



We consider the next objective function: 

c n 

Jkf cma {U,W,a) = - EE [uiks{<P{xk),Wi) - X ^Uik log Uik/ai] 

2—1 k—1 

where W = (wi , . . . , Wc) is cluster centers in the high-dimensional space. 

The solutions in the FCM alternative minimization algorithm are 

( 7 ) 

( 8 ) 

(9) 

in which (7) is the same as (4). However, solutions of (8) and (9) cannot directly 
be obtained, since an explicit form of ’P{xk) is unavailable. 

This problem is solved by eliminating W by substituting (9) into (8): 

s{^{xk),w{) = ^ ^ K{Xj,Xt) ). (10) 

^J UijUi£Kj£ 

Thus, by repeating (7) and (8), we obtain an iterative solution for Uik and a^. 
Notice that (10) should be used in calculating Uik by (8). 



E n 

k=l Uik 



'^ik — 



Wi = 



aiexp{Xs{<P{xk),Wi)) 

0!j exp{Xs{<P{xk),Wj)) 

Y,k=l Uik^Xk) 



3.3 Crisp c-Means 

A crisp c- means clustering algorithm can easily be derived by modifying Jefcma- 
We consider the next objective function: 

c n 

Jccma {U,V,a) = -EE [uiks{xk,Vi) + X ^Uiklogai] 

i—1 k—1 
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This objective function is obtained from eliminating term Uik log Uik from Jefcma- 

We see that min Jccmoleads to a crisp solution, since it is linear with respect 
ueM 

to Uik- Thus, calculation of U is reduced to 



Uik = 1 <J= 


^ i = arg max s{xk,Vi), 

l<j<c 


(11) 


Uik = 0 ^ 


* yf arg max s{xk,Vi). 

i<i<c 


(12) 



The algorithm with the kernel uses (7), (10), and (11), (12). 



3.4 Clustering by Competitive Learning 

Clustering by competitive learning is also a standard technique of unsupervised 
classification [2]. A basic competitive learning algorithm is as follows. 

Algorithm CCL: clustering by competitive learning. 

Step 1. Randomly select cluster centers Vi, i = 1, . . . ,c. 

Normalize Xk- 

Xk ^ Xk/\\xk\\, k=l,...,n. 

Set t = 0. 

Step 2. Repeat Step 2.1 and Step 2.2 until the solution is convergent. 

Step 2.1. Allocate Xk to the cluster i: 

Vi = arg max < Xk,Vj > . 

l<j<c 

Step 2.2. Update the cluster center: 

Vi ^ Vi + ri(t)xk- and Vi^Vi/\\vi\\. 



Let t <— t + 1. 

End CCL. 

This algorithm can directly be applied to the document clustering. To handle 
the strong nonlinearity, we consider the use of kernels in this algorithm. We use 
’P{xk) instead of Xk', hence the normalization is 

yk ^ <P{xk)/\\<^{xk)\\ 



and the allocation rule is 



i = arg max < yk, Vi > . 

l<j<c 

Since we do not use <P{xk) explicitly, < yk,Vi > has to be represented by the 
kernel. 

Let 



p{xk,i;t) =< yk,Vi > 
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be the value of the scalar product at the time t. From the updating equations 
Vi^Vi + rj{t)yk, Vi^Vi/\\vi\\, 



we note 






\vi + 'q{t)yk\\ 



Put Vi{t) = 1 1 fill and note that 



< Vi,yk >= 



K 



ik 



where Kjk = K{xj,Xk)- We then have 

(i + 1) = (t) + ‘irj{t)p{xk + (t) 



p{xj,i;t) + y{t ) — ^ 



p{xj,i]t + l) = 



Kick 



v,{t + l) 



(13) 

(14) 



These equations are used instead of the algorithm CCL. The initial values for Vi 
should be selected from yj, j = 1, . . . , n. Then the initial calculation of Vi(t) and 
p{xj,i,t) for t = 0 is straightforward. 



4 A Numerical Example 

We used a document database made by Japan Society for Fuzzy Theory and 
Systems, which includes titles, keywords, abstracts, etc of papers presented at 
Fuzzy System Symposia in Japan. Forty documents of which 20 discuss neural 
networks and the other 20 study image processing were selected. 

Five keywords of neural network, fuzzy, image, model, data were used and 
memberships are attached by the next rules: 

1. If a keyword is in the title, its membership is 1.0. 

2. If a keyword is in the keyword list, its membership is 0.5. 

3. If a keyword is in the abstract, its membership is 0.2 

Since we use fuzzy multisets, if a keyword, say fuzzy, occurs in the title, keyword 
list, and abstract, then the membership is {1.0, 0.5, 0.2}. 

Figures 1 and 2 show the results of clustering using Jefcma and Jkfcma, re- 
spectively, where the number of clusters is c = 2. It should be noticed that Jefcma 
is without a kernel while Jkfcma is with the Gaussian kernel. The both are 
entropy-based methods of fuzzy c-means. The symbols □ and x represent doc- 
uments of ‘image processing’ and ‘neural networks’, respectively. 

Since we use c = 2, the lower half of a figure should be a cluster while the 
upper half should be another cluster. Thus, we have five misclassifications in 
Figure 1, while no misclassifications are found in Figure 2. 

Results by other methods are omitted to save the space, but these figures 
already show effectiveness of the kernel trick. 
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1 



0.5 



0 

0 0.5 1 1.5 2 2.5 

Fig. 1. Result from Jefcma without a kernel. The horizontal axis is the norm of the 
fuzzy multisets and the vertical axis is the membership value for a cluster 

1 



0.5 



0 

0 0.5 1 1.5 2 2.5 

Fig. 2. Result from Jkfcma with the Gaussian kernel 

5 Conclusion 

Advance of information systems requires a new information retrieval model. In 
this paper we have shown a fuzzy multiset model suited for information retrieval 
systems on the WWW. Moreover nonlinearities in unsupervised automatic clas- 
sification have been dealt with using additional variable for controlling cluster 
volume sizes and employing kernel tricks in support vector machines. The nu- 
merical example shows effectiveness of the kernel-based method. 

Future studies include simplification of the model, reduction of computation, 
and test and comparison of these methods on larger sets of document data. 
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Abstract. In using a classified data set to test clustering algorithms, 
the data points in a class are considered as one cluster (or more than 
one) in space. In this paper we adopt this principle to build classification 
models through interactively clustering a training data set to constrnct 
a tree of clusters. The leaf clusters of the tree are selected as decision 
clnsters to classify new data based on a distance function. We consider 
the featnre weights in calculating the distances between a new object and 
the center of a decision cluster. The new algorithm, lU-fc-means, is used 
to automatically calculate the feature weights from the training data. 
The Fastmap technique is used to handle outliers in selecting decision 
clusters. This step increases the stability of the classifier. Experimental 
results on public domain data sets have shown that the models built 
using this clustering approach outperformed some popular classification 
algorithms. 

Keywords: DCC, classification, clustering, data mining, feature weight 



1 Introduction 

In this paper, we present a feature weighting approach to building classification 
models through interactively clustering a training data set to construct a tree of 
clusters. The leaf clusters of the tree are selected as decision clusters to classify 
new data. The set of decision cluster centers, together with the labels of dominant 
classes in the decision clusters and a distance function, forms the classification 
model, called DCC (Decision clusters classifier) . When a DCC model is used to 
classify a new data object, the distances between the object and the centers of 
the decision clusters are computed. The decision cluster of the shortest distance 
to the object is selected and the dominant class of this decision cluster is assigned 
as the class of the object. The percentage of the dominant class in the decision 
cluster is the confidence of the classification to the new object. 

In deciding the decision cluster for a new object, the weights of features 
describing the object are considered. The features that have more contributions 
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to the formation of the decision clusters have higher weights in the distance 
function than other features. The feature weights are automatically calculated 
in the cluster tree building process using the new FF-fc-means algorithm [1] we 
recently developed. The feature weighting approach is able to treat features 
differently in making classification decisions. It is well known that clusters are 
often formed in a subspace defined by a subset of features. Using feature weights 
we address the problem of subspace clusters. 

In selecting decision clusters, we also consider outliers. An outlier is a small 
cluster that has even distribution of classes, i.e., no significant dominant class. 
These clusters occur in the boundaries of other clusters which have clear dom- 
inant classes. Removal of these outliers can increase the accuracy and stability 
of the DCC models. In this approach we use the Fastmap technique to visually 
verify the outlier clusters [2]. 

We have implemented a prototype system, called VC+, in Java to facili- 
tate the interactive process to build DCC models. We conducted a series of 
experiments on public domain data sets from the UCI Machine Learning data 
repository [3]. The results have shown that the DCC models outperformed some 
popular classifiers. We also experimented feature reduction based on weights and 
observed increase of the classification accuracy after insignificant features were 
removed. 

The DCC model is very similar to the KNN model but their model building 
processes are different. The DCC model is more efficient since it uses the centers 
of decision clusters rather than individual records in the training data set. In [4], 
Mui and Fu presented a binary tree classifier for classification of nucleated blood 
cells. Each terminal node of the tree is a cluster dominated by a particular group 
of blood cell classes. This work was later advanced by the use of the /c-means 
algorithm to generate clusters at each non-terminal node and determine the 
grouping of classes [5] . 

Although the study of algorithms for building classification models has been 
focused on automatic approach, the interactive approach has recently been 
brought to attention again [6] with enhancement of the sophisticated visualiza- 
tion techniques. The great advantage of the interactive approach is that human 
knowledge can be used to guide the model building process. Ankerst et al.’s 
work is constructing the classification model based on the IDS and C4.5 algo- 
rithms for tree-growing [7] [8]. In general, decision trees represent a disjunction 
of conjunctions of constraints on the feature values of instances while our clus- 
ter tree considers all features but weights them according to the importance of 
clustering. 

The paper is organized as follows. In Section 2, we describe the interactive 
approach to building DCC models. In Section 3, the JF-fc-means algorithm will 
be introduced. In Section 4, some experimental results on several well-known 
data sets are given to show the DCC models outperformed the other popular 
classifiers. Finally, we conclude this paper in Section 5. 
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2 Construction of DCC Models 

In this section, we describe a decision clusters classifier (DCC) for data mining. 
A DCC model is defined as a set of p decision clusters generated with a clustering 
algorithm from a training data set. A decision cluster is labelled by one of the 
classes in data, called dominant class. The DCC model classifies new objects 
by deciding which decision clusters these objects belong to. A DCC model is 
extracted from a tree of clusters built from the training data set. 

Building a cluster tree from a training data set is to find a sequence of nested 
clusters in the data set. We use a top-down approach to interactively conducting 
clustering and cluster validation to construct a tree of clusters as shown in Fig. 1. 
Starting with the root node C that represents the entire training data set, we use 
a clustering algorithm to divide it into three clusters, {Ci, C 2 , C 3 }. Then, we can 
use the target feature to validate each cluster by computing the distribution of 
classes and finding the most frequent class. If the frequency of the most frequent 
class is greater than a given threshold, we assign this class to the cluster as the 
dominant class. For example, the dominant class of cluster C 3 is Otherwise, 
we further partition the cluster into sub clusters such as Cn and C 12 . If the 
size of a cluster is smaller than a given threshold, we stop the further clustering 
and do not assign a dominant class to it, such as C 2 . Such clusters will not be 
selected as decision clusters. 




Fig. 1. A cluster tree created by inter- 
active clustering and cluster validation. 
The symbols, ^ and *, show dominant 
classes of leaf clusters. Some leaf clus- 
ters do not have a dominant class due 
to the even distribution of classes 



Deciding whether to further partition a node into sub-clusters or not is equiv- 
alent to deciding the terminal nodes and the best splitting in decision trees. In 
fact, our cluster tree is a kind of decision trees although we do not use it to 
make classification decisions. We determine a cluster as a terminal node based 
on two conditions: ( 1 ) its objects are dominated in one class and ( 2 ) it is a nat- 
ural cluster in the object space. Condition (1), which is widely used in many 
decision tree algorithms, is determined based on the frequencies of classes in the 
cluster. If no clear dominant class exists, the cluster will be further partitioned 
into sub-clusters. 

If a cluster with the dominant class is found, we do not simply determine it 
as a terminal node. Instead, we investigate whether the cluster is a natural one 
or not by looking into its compactness and isolation [9]. To do so, we adopt the 
Fastmap algorithm [10] to project the objects in the cluster onto a 2-dimensional 
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(2D) space. Given a cluster, the 2D projection allows us to visually identify 
whether sub-clusters exist in it. If we see any separate clusters in the 2D projec- 
tion, we can conclude that sub-clusters exist in the original object space and use 
the clustering algorithm to find these sub-clusters. However, If there are no sepa- 
rate clusters on the display, we do not simply conclude that the cluster (e.g. C 2 ) 
is a natural cluster. Instead, we visualize the distribution of the distances be- 
tween objects and the cluster center. This visual information further tells us 
how compact the cluster is. Here we take advantage of the Fastmap projection 
to assist the selection of k for the fc-means algorithm settings. By projecting 
the objects in the cluster onto a 2D space and visualizing objects of different 
classes in different colors or symbols, we can examine the potential number of 
clusters and the distribution of object classes in different clusters. Therefore, in 
determining fc, we not only consider the number of potential clusters but also 
the number and distribution of classes in the cluster. 

Let X denote the training data set, 0 the FF-fc-means algorithm and F the 
Fastmap algorithm. We summarize the interactive process to build a cluster tree 
as follows. 

1. Begin: Set X as the root of the cluster tree. Select the root as the current node Sc; 

2. Use F to project Sc onto 2D. Visually examine the projection to decide k, the 
number of potential clusters; 

3. Apply 0 to partition Sc into k clusters; 

4. Use F and other visual methods to validate the partition. (The test data set can 
also be used here to test the increase of classification accuracy of the new clustering; 

5. If the partition is accepted, go to step 6, otherwise, select a new k and go to step 

3; 

6. Attach the clusters as the children of the partitioned node. Select one as the current 
node Sc; 

7. Validate Sc to determine whether it is a terminal node or not; 

8. If it is not a terminal node, go to step 2. If it is a terminal node, but not the last 
one, select another node as the current node Sc, which has not been validated, and 
go to step 7. If it is the last terminal node in the tree, stop. 

After we build a cluster tree from the training data set using this process, we 
have created a sequence of clusters. In principle, each cluster is a DCC model. 
Their classification performances are different. Therefore, we use a test data 
set to identify the best DCC model from a cluster tree. We start from a top 
level clustering. First, we select all clusters of the top level clustering as decision 
clusters, use them to classify the test data set and calculate the classification 
accuracy. Then we identify the decision clusters, which have classified more ob- 
jects wrongly than other clusters. We replace these clusters with its sub-clusters 
in the lower level clustering and test the model again. We continue this process 
until the best DCC model is found. 

Each level of clustering in the cluster tree is a partition of the training data 
set. However, our final DCC model is not necessarily to be a partition. In the 
final DCC model, we often drop certain clusters from a clustering. For example, 
some leaves (e.g. C 2 ) in Fig. 1 do not have class symbols. These clusters contain 
few objects in several classes. These are the objects, which are located in the 
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boundaries of other clusters. From our experiments, we found that dropping 
these clusters from the model can increase the classification accuracy. 

3 Feature Weighting 

In growing the cluster tree, we combine the process with the feature weights 
calculated by the hF-fc-means algorithm. A major problem of using the basic k- 
means type algorithms in data mining is selection of features. The fc-means type 
algorithms cannot select features automatically because they treat all feature 
equally in the clustering process. However, it is well known that an interesting 
clustering structure usually occurs in a subspace defined by a subset of the 
initially selected features. To find the clustering structure, it is important to 
identify the subset of features. 

The IF-fc-means algorithm calculates feature weights automatically. Based on 
the current partition in the iterative fc-means clustering process, the algorithm 
calculates a new weight for each feature according to the variance of the within 
cluster distances. The new weights are used in deciding the cluster membership 
of objects in the next iteration. The feature weights measure the importance 
of features in clustering. The small weights reduce or eliminate the effect of 
insignificant (or noise) features. The weights are effectively to identify clusters 
that are in the subspace by the subset of features with big weights. 

The hF-fc-means algorithm is briefly described as follows. The new algo- 
rithm is motivated by fuzzy fc-means clustering algorithms [11]. Let X = 
{Xi,X 2 , ..., Xn} be a set of n objects, where Xi = [x^p, The W-k- 

means algorithm is formulated as the following minimization problem P: 

k n m 

P{U,Z,W) = (1) 

j^l 



subject to 

( ^ 

= 1 , 1 <i <n 

1^1 

^ Ui,i € {0, 1}, 1 < z < n, 1 < / < /c (2) 

m 

^Wj = l 

i=i 

where [/ is an nx A: partition matrix and, Z = {Zi, Z 2 , ..., Zfc} is a set of k vectors 
representing the centers of the clusters, W = [wi,W 2 , ■■■, Wm] is the weight vector, 
(3 > 1, and d{-, •) is the distance between two objects. If the feature is numeric, 
then d{xij, zij) = {xij — zij)'^. If the feature is categorical, then 

f 0 {xij = zij) 
d{xij,zij) = < 

[_ 1 {xij yf zij) 
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Problem P can be solved by iteratively solving the following three minimization 
problems: 

1. Problem P\: Fix Z — Z and W = W, solve the reduced problem P{U, Z, W); 

2. Problem P 2 : Fix U = U and W = W, solve the reduced problem P(U, Z, W); 

3. Problem P 3 : Fix U = U and Z = Z, solve the reduced problem P{U, Z, W). 

Problem Pi is solved by 

{ m ™ /3 

Ui,i = 1 if < Y. % d{xi,j,Ztj) for 1 <t < k 

J=i .=1 (3) 

Ui^t = 0 for t yf I 

Problem P 2 is solved by 

n 

for 1 < / < fc and 1 < j < m (4) 

i=l 



if the feature is numeric. If the feature is categorical, then zij = aj, where aj is 
the mode of the feature values in cluster 1. 



And problem P3 is solved by 




if Dj = 0 
if D, yf 0 



where Dj = Yi=iY7=i'di,id{xi,j , zij), and h is the number of features 
where Dj yf 0. See [1] for the detail proof. The optimal clustering results are 
obtained as the process converges. Most importantly, a set of feature weights are 
produced automatically in the clustering process so we can use them to select 
important features in clustering and building classification models. 



4 Experiments 

We have implemented a prototype system, called PC+, in Java to facilitate the 
interactive process to build DCC models. In this section, we use experimental 
results on real data sets to demonstrate the classification performance of the 
DCC models in comparison with other popular classifiers. We also show that 
feature selection based on weights improves the classification accuracy. 

4.1 Experiment Data Sets 

In our experiments, we tested our DCC models against four public data sets 
chosen from the UCI Machine Learning data repository [3] and compared our 
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results with the results of the Quinlan’s C5.0 decision tree algorithm, Discrim 
(a statistical classifier developed by R.Henery), Bayes (a statistical classifier 
which is a part of IND package from NASA’s COSMIC center) and KNN (a 
statistical classifier, developed by C. Taylor). The characteristics of the four data 
sets are listed in Table 1. Size, data complexity and classification difficulties 
were the major considerations in choosing these data sets. The Heart and Credit 
Card data sets contain both numerical and categorical attributes. The Heart and 
Diabetes data sets are among those that are difficult to classify (low classification 
accuracy) [3]. 

Table 1. Four data sets from the UCI machine learning data repository 



Data Sets 


Training 

Instances 


Test 

Instances 


Numerical 

Features 


Categorical 

Features 


No. of 
Classes 


Heart 


189 


81 


7 


6 


2 


Credit Card 


440 


213 


6 


9 


2 


Diabetes 


537 


230 


8 


0 


2 


Satellite Image 


4435 


2000 


36 


0 


6 



4.2 Creating DCC Models 

We used a top-down approach to interactively conducting clustering and cluster 
validation to build a cluster tree following the process steps given in Section 2. 
Fig. 2 shows a tree generated from the Credit Card data set. 




Fig. 2. A cluster tree representing 
DCC Models 




"oa 




Fig. 3. FastMap View for 
a cluster without class la- 
bel 



At first, we created node S as the root of this tree, and used Fastmap algo- 
rithm to project it into a 2D space. According to the 2D display, we decided to 
create 4 clusters. We combined the clustering algorithm and the feature weights 
obtained by the VF-fc-means algorithm to partition node S into four clusters. 
Then, we calculated the distribution of the classes for every cluster. If the high- 
est frequency of the class was greater than the given threshold 9 (here, 0=80%), 
we assigned this class to the cluster as the dominant class. Because none of the 
clusters {Ai, A 2 , A 3 , A 4 } satisfied this threshold, we further clustered them by 
repeatedly using the clustering and Fastmap algorithms. 
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Fig. 3 shows the Fastmap view for node Ai. The cluster center is pointed 
out as centroid. We found that node A\ contained few instances in the mixture 
classes. This indicated that no dominant class could be identified. We could 
further cluster this node into sub clusters but the sub clusters would be very 
small. These small clusters, if selected as decision clusters, could cause an over- 
fitting problem. To avoid it, we stopped further clustering and did not assign 
a dominant class to it. 

The nodes A 2 and A 4 in Fig. 2 were clustered into the sub-clusters which 
were assigned to dominant classes. Using Fastmap views (similar to Fig. 3) we 
verified that these sub-clusters were compact, so we made these clusters as the 
terminal nodes. We used the dark and light colors to represent the Yes and No 
classes respectively. 

For node A 3 , we repeated the same clustering process to partition it into 
children and grand children clusters. We used the Fastmap view to identify the 
outliers in further partitions. Fig. 4 shows an outlier marked as O. It contained 
only few instances but did not have a dominant class. We considered it was 
located at the boundaries of other clusters as shown in Fig. 5. If such cluster 
was selected as a decision cluster, an object closest to it would likely be wrongly 
classified. If we did not select it as decision clusters, the objects around it would 
be classified by its neighboring decision clusters which were more stable because 
of clear dominant classes. 




Fig. 4. Outlier shown in the FastMap View 




Continuing the above steps we interactively built a whole clustering tree from 
which we could identify the DCC models. To test the performance of different 
DCC models, we used a test data. For every DCC model, we calculated the 
classification accuracy of the test data set and selected the DCC model with the 
highest accuracy. 

4.3 Weighting and Selecting Features 

In the experiments, we used the VF-fc-means algorithm to calculate feature 
weights automatically. Fig . 6 shows the distribution of the feature weights of 
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one clustering result from the Credit Card data set. The weight values of the 
features are given in Table 2. 




1 3 5 1 9 11 13 15 

Feature Index 

Fig. 6. Distribution of feature 
weights from one clustering result 
of the Credit Card data set 




iluiK<r o: Ffstuw 

Fig. 7. The relationship between 
classification accuracy and the 
number of the removed features 
obtained from the Credit Card data 
set 



Table 2. The weights of the Features in the Credit Card data set 



Credit Card Data 


Fi 


0.0130 


F4 


0.0167 


Fr 


0.0093 


Fio 


0.0139 


Fi3 


0.0167 


F2 


0.1652 


Fs 


0.0167 


Fs 


0.5167 


Fii 


0.0088 


Fi4 


0.0044 


F3 


0.1871 


Fe 


0.0044 


Fg 


0.0167 


Fi2 


0.0083 


Fi6 


0.0021 



According to the salience of the features, we stepwise removed features of 
small weights and built the DCC models on the subset of the remaining features. 
Fig. 7 shows the relationship between the classification accuracy and the number 
of removed features produced from the results of the Credit Card data set. The 
horizontal axis represents the number of removed features, while the vertical axis 
represents the classification accuracy. 

Fig. 7 indicates that the performance of the DCC model can be improved 
by removing some less important features, i.e., the features with small weights, 
from the data set. However, the number of features to be removed has to be well 
controlled. From Fig. 7 we can see that the classification accuracy increased as 
the lowest and the second lowest features were moved. This may indicate that 
instead of contributing to the classification model, these two features affected the 
performance of the model. Removing them decreased the impact of these less 
important variables on the model, reduced the dimensionality of the problem 
and increased the stability of the model. However, further removal of more less 
important features resulted in a counter-effectiveness, i.e., reducing the accuracy 
of the model. This implies that too much useful information is removed and the 
subspace of the remaining features is not able to obtain a clear separation of the 
classes in the DCC model. 
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The number of features that can be removed from the data set is different 
in different data sets. The optimal number can be found from experiments. Our 
approach to using the feature weights to decide which features are removed 
can significantly reduce the number of the experiments in finding the optimal 
number. 

4.4 Comparison with Other Classifiers 

Table 3 shows the accuracies of our DCC models on the four data sets together 
with the results of other five classification algorithms including Quinlan’s C5.0 
and its boosted version, Discrim (a statistical classifier developed by R.Henery), 
Naive Bayes (a part of IND package from NASA’s COSMIC center) and KNN 
(developed by C. Taylor). On average, the DCC models outperformed other clas- 
sifiers. In our DCC models, the accuracy of the training data set is very close to 
the accuracy of the test data set. This indicates that our DCC models did not 
have significant over-fitting problems in comparison with other models. 

The results also assert that the DCC models are stable and robust in dealing 
with different data types, e.g., numerical, categorical or mixture. Some of the 
results were generated from the reduced spaces of the original data sets. This is 
an advantage of our approach because we can build equivalent models from a 
subspace of less dimensions so the stability and robustness of the DCC models 
can be increased and the complexity of computation and storage is reduced. 

Table 3. Comparisons of DCC models with other classifiers (in terms of %) 





1 HEART 


1 CREDIT CARD] 


1 DIABETES 


1 SATELLITE IMAGE | 


Train 


Test 


Train 


Test 


Train 


Test 


Train 


Test 


DCC 


78.84 


85.19 


83.32 


87.32 


75.79 


75.22 


86.20 


84.70 


C5.0 


96.30 


87.85 


90.00 


84.98 


80.26 


76.09 


98.84 


85.90 


Boosted C5.0 


98.94 


87.65 


99.09 


87.32 


96.65 


73.91 


99.95 


90.45 


Discrim 


68.50 


60.70 


85.10 


85.90 


78.00 


77.50 


85.10 


82.90 


Bayes 


64.90 


62.60 


86.40 


84.90 


76.10 


73.80 


69.20 


72.30 


KNN 


100.00 


52.20 


100.00 


81.90 


100.00 


67.60 


91.10 


96.60 



5 Conclusions 

In this paper we have presented an interactive clustering approach to building 
classification models. We have presented the methods to use the IT-Zc-means 
algorithm to calculate the feature weights that are used in making classification 
decisions on new objects. We have showed how to use the Fastmap technique to 
determine the number of clusters to be partitioned in a cluster node and how to 
use the Fastmap view to identify outliers. We have used experiment results on 
real data sets to demonstrate the advantages of our approach in producing high 
accurate, stable and robust classification models. 

To the best of our knowledge this was the first attempt to use the fc-means 
type clustering algorithms, feature weighting approach, the Fastmap technique 
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and outlier handling technique collectively in solving classification problems. 
Our experiment results have demonstrated that this approach worked well in 
comparison with other methods. Our experiments were conducted interactively 
in this work. We plan to investigate an automatical approach to building the 
DCC models by optimizing the construction of the cluster tree and selection of 
decision clusters. 
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Abstract. This paper introduces the notion of a fuzzy context model as 
a formal framework for representation and manipulation of vague knowl- 
edge. The motivation for the fuzzy context model arises from the con- 
sideration of several practical situations in data analysis, interpretation 
of vague concepts, and modeling expert knowledge for decision-making 
support. It is shown that the fuzzy context model can provide a con- 
structive approach to fuzzy sets of type 2 emerged from a view-point 
of modeling vaguely conceptual knowledge as well as to a uncertainty 
measure of type 2, which is induced from vague knowledge expressed 
linguistically. 

Keywords: Fuzzy context model, uncertainty measure of type 2, de- 
cision-making, context-dependent fuzzy set 



1 Introduction 

Vagueness and uncertainty are fundamental and unavoidable features in many 
various research fields. As is well-known, two the most widely-used approaches 
to dealing with uncertainty and vagueness are probability theory and fuzzy set 
theory. In recent years, motivated by varying concerns, researchers have intro- 
duced numerous other approaches to dealing with uncertainty and vagueness, 
including rough set theory [22], Dempster-Shafer theory of evidence [1, 23], the 
transferable belief model [24], the context model [7], among many others. 

Especially, in [7] Gebhardt and Kruse have introduced the notion of context 
model as an integrating model of vagueness and uncertainty. The motivation for 
the context model arises from the intention to develop a common formal frame- 
work that supports a better understanding and comparison of existing models 
of partial ignorance to reduce the rivalry between well-known approaches. Es- 
sentially, the authors presented basic ideas keyed to the interpretation of Bayes 
theory and the Dempster-Shafer theory within the context model. Furthermore, 
a direct comparison between these two approaches based on the well-known 
decision-making problems within the context model were also examined in their 
paper. More recently, in [11, 12, 14] we have shown that the notion of context 
model can be used as a unified framework for modeling fuzziness in vague concept 
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analysis as well as uncertainty in decision analysis situations. Interestingly, from 
a concept analysis point of view, the context model can be semantically consid- 
ered as a data model for constructing membership functions of fuzzy concepts 
in connection with likelihood as well as random set views on the interpretation 
of membership grades. 

In this paper, we extend the notion of context model to the so-called fuzzy 
context model for dealing with situations where both vagueness and conflict co- 
exist. To proceed, however, it is first necessary to give a brief clarification of the 
motivation for such an extension. This is undertaken in Section 2. In Section 
3, the notion of context model and its relation to Dempster-Shafer theory are 
briefly presented. Sections 4 introduces the notion of a fuzzy context model, and 
then describes how the fuzzy context model can provide a formal framework 
for representation and manipulation of vague knowledge in modeling context- 
dependent vague concepts, and integrating expert knowledge in decision-making 
support. Finally, Section 5 presents some concluding remarks. 

2 Motivation 

To clarify our motivation in this paper, let us observe the following situations. 

2.1 Motivation in Data Analysis 

As observed by Gebhardt and Kruse [7], in a large number of applications in 
the held of knowledge-based systems, data characterizes the state of an object 
(obj) with respect to underlying relevant frame conditions (cond). In this sense, 
we assume that it is possible to characterize obj by an element state(o6j, cond) 
of a well-defined set dom(ofoj) of distinguishable object states, usually called the 
universe of discourse or frame of discernment of obj with respect to cond. Then 
we are interested in the problem that the original characterization of state(o6j, 
cond) is not available due to a lack of information about obj and cond. Gener- 
ally, cond merely permits us to use statements like “state(o6j, cond) S char(o6j, 
condf\ where char(o&j, cond) C dom(o6j) and called an imprecise characteri- 
zation of obj with respect to cond. The second kind of imperfect knowledge in 
context model is conflict. This kind of imperfectness is induced by information 
about preferences between elements of char(obj,cond) that interprets for the ex- 
istence of contexts. The combined occurence of imprecision and conflict in data 
reflects vagueness in the context model, and state(o6j, cond) is descripted by 
a so-called vague characteristic of obj with respect to cond. 

Although information about preferences between elements of char(o6j, cond) 
is modelled by contexts, this also means they have the same possibility or chance 
to be the unknown original value of state(o6j, cond) in each context. However, in 
many practical situations, even in the same context elements of char(o&j, cond) 
may have different degrees of possibility to be the unknown original value of 
state(o&j, cond). Especially in the situations where cond only permits us to ex- 
press in the form of verbal statements like “state(o6j, cond) is A”, where A is 
a linguistic value represented by a fuzzy set in dom(ofoj). 
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2.2 Motivation in Modeling Expert Knowledge 

Let us consider a predictive problem with a predictive variable p associated 
with the domain D. Assume that from available statistical data, making use of 
traditional techniques of prediction modeling, one may obtain several possible 
predictions for p, represented by subsets of D, say Ai, . . . , Note that predic- 
tions may be given in the form of rules as, for example, in rough set prediction 
with p is the decision attribute. Then Aj, for * = 1, . . . , n, are subsets of D which 
respective rules are satisfied. Conventionally, randomization techniques can be 
used to test the significance of predictions. Although randomization methods are 
quite useful, they are rather expensive in resources and are only applicable as 
a conditional testing scheme [5]. That is, though they tell us when a rule may 
be due to chance, they do not provide us with a metric for the comparison of 
different rules. In the case of lacking any such a randomization method, it would 
be useful if one could utilize domain expert knowledge expressed in the form 
of verbal statements in a testing scheme. Let us consider a simple example as 
follows. 

Example 1. Assume that we want to forecast the temperature of the next day. 
Let D = {—40, . . . , 40} be the domain of the variable temperature (measured in 
°C). We are told by expert E\ that tomorrow’s temperature will be very high, 
whereas another expert E 2 asserts that it will be medium. Assuming that we 
have degree of confidence of 0.6 in expert Ei and of 0.4 in expert i? 2 , what is 
our preference about some predicted intervals of tomorrow’s temperature? 

This example is a variant of an example found in Denoeux [2]. However, whilst 
Denoeux proposed a principled approach to the representation and manipulation 
of imprecise degrees of belief within the framework of Dempster-Shafer theory 
(DS theory, for short), in the following we will model the problem by using the 
notion of fuzzy context model. 

2.3 Motivation in Modeling Context-Dependent Vague Concepts 

The notion of fuzzy sets was firstly introduced by Zadeh [28] as a mathematical 
modeling of vague concepts in natural language, making use of the notion of 
partial degrees of membership in connection with the representation and manip- 
ulation of human knowledge automatically. 

Practically, the context-sensitive nature of fuzzy (or vague) concepts in natu- 
ral language such as “tall”, “large”, etc., is well known. For example, the concept 
“tall” described qualitatively height of people might very well mean something 
quite different in the view of European and that of Asian. That is, for each fuzzy 
concept F imposed on objects in a universe U, the meaning of F may change 
with “context” . To formalize this, we may assume there is a finite, non-empty 
set C of contexts at which the fuzzy concept is being conceived or realized. Note 
that by “context” we mean a generic term that may stand for situation, context, 
agent, person, etc. Furthermore, at each context c G C, the fuzzy concept F 
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may be conceived in/by c in a fashion based on either bivalent logic or multiva- 
lent one. For example, let us consider the fuzzy concept “tall” and a particular 
person, say John, with height of h. Moreover, we have a population of different 
individuals (voters) considered as contexts. Now, given John’s height of h, each 
voter is asked to give an answer to the question “Is John tall?” . In general, there 
may be some voters who are hesitant to say “Yes” or “No” to the question, and 
then the response should be a matter of degree. However, we would also argue 
that this does not invalidate the consideration of bivalent logic if we disallow the 
possibility of refusing of voters to respond with a “Yes/No” answer. 

3 Basic Concepts of the Context Model 

Formally, a context model is defined as a triple 

M = {D,C,rc{D)) 

where D is a, nonempty universe of discourse^ C is a nonempty finite set of 
contexts, and the set rc{D) = {a|a : C 2^} which is called the set of all vague 
characteristics of D with respect to C. Let a G Fc^D), a is said to be contradictory 
(resp., consistent) if and only if there exists c G C such that a(c) = 0 (resp., 
HcgC ^ ®)- 0 , 1,02 G Fc{D), then a\ is said to be more specific than 02 

if and only if (Vc G C)(ai(c) C 02 (c)). In this paper we confine ourselves to only 
vague characteristics that are not contradictory in the context model. 

If there is a finite measure Pc '■ 2^ ^ R+ that fulfills (Vc G C){Pc{{c}) > 0), 
then a G Pc{D) is called a valuated vague characteristic of D with respect to Pc- 
Then we call a quadruple M,\ = {D,C, Pc{D), Pc) a valuated context model. 
Mathematically, if Pc {C) = 1 the mapping o : C ^ 2^ is a random set but 
obviously with a different interpretation within the context model. 

Let o be a vague characteristic in M,\ = {D,C,Pc{D),Pc). For each X G 
2^, we define the acceptance degree AcCa(Y) that evaluates the proposition 
“state(o6j, cond) G X” is true. Due to inherent imprecision of a, it does not 
allow us to uniquely determine acceptance degrees AcCa(Y), X G 2^. However, 
as shown in [7], we can calculate lower and upper bounds for them as follows: 

^(X) = Pc({cGCI0^a(c)CX}) (1) 

)^,(Y) = Pc({ceC|a(c)nX^0}) (2) 

Clearly, both DS theory and the context model are closely related to the 
theory of multivalued mappings. In fact, each vague characteristic in the context 
model is formally a multivalued mapping from the set of contexts into the uni- 
verse of discourse. Now, for the sake of discussing essential remarks regarding 
the interpretation of the DS theory within the context model, we assume that Pc 
is a probability measure on C. Let a be a vague characteristic in AJi consider- 
ing now as a multivalued mapping from C into D. The domain of a, denoted 
by Dorn (a), is defined by 



Dom(a) = {cG C|a(c) yf 0} 
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Then a induces lower and upper probabilities, in the sense of Dempster, on 2^ 
as follows. For any AT G 2^, 



P{a),{X) 



Pc{a-{X)) 

Pc{a-{D)) 



where 



P(«)*(X) 



Pcja+jX)) 

Pc{a+{D)) 



a (X) = {c € C|c G Dom(a) A a(c) C X} 
a+lx) = {c&C\a{c)f^X^%} 



Clearly, a^{D) = a~{D) = Dom(a), and P(a)*,P(a)* are well defined only 
when Pc(Dom(a)) ^ 0. 

In the case where a is non-contradictory, we have Dom(a) = C. Then, these 
probabilities coincide with lower and upper acceptance degrees as defined in (1) 
and (2) respectively. That is, for any X G 2^, 



P(a)*(X) = ^{X), and P{a)*{X) = AcCa{X) 



Furthermore, Gebhardt and Kruse also defined the so-called mass distribution 
ma of a as follows 



ma{X) = Pc{a ^(X)), for any X G 2^ 

Then, for any X G 2^, we have 

^^(X) = 

AGa{C):0^ACX 

AcCa(X) = ^ ma{A) 

Aea{C)-.Anx^iS 

As such the mass distribution ma induced from a in the context model C can 
be considered as the counterpart of a basic probability assignment in the DS 
theory. However, the motivations of two the approaches are somewhat different. 
More details on the context model and its applications can be found in [7, 8, 9, 
19]. 

4 The Fuzzy Context Model 

4.1 Definition 

As observed above, although the context model can be considered as an au- 
tonomous approach to the handling of imperfect knowledge, it in its standard 
form does not allow us to directly model situations where cond only permits 
us to express state(o&j, cond) in each context in the form of verbal statements. 
Moreover, we may also agree that vague concepts are used as verbal descriptions 
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about characteristics of objects with a tolerance of imprecise in human reason- 
ing. Practically, people often use statements like “att(o6j) is A”, where A is 
a linguistic term that qualitatively describes an attribute of the object denoted 
by 3tt(obj). In addition, the specific meaning of vague concepts in human think- 
ing and communication is always determined by contexts, by personal views, 
etc, i.e. their interpretation (and/or meaning) depends on which context they 
are uttered in. 

These observations motivated us, with some abuse of notation, to introduce 
an extension of the context model so-called fuzzy context model as a quadruple 

TM = {D,C,rc{D),Pc) 

where D he a, nonempty universe of discourse, C is a nonempty finite set of 
contexts, and rc{D) is a subset of the set {a|a : C ^{D)} which is called 
the set of all context-dependent vague characteristics of D with respect to C, 
here tF{D) denotes the set of all normal fuzzy subsets of D. By definition, we 
also restrict to consider only context-dependent vague characteristics that are 
not contradictory in the fuzzy context model. 

4.2 Context-Dependent Vague Concepts Interpreted 
by the Fuzzy Context Model 

The notion of type 2 fuzzy sets was introduced by Zadeh in [29]. Let U be 
a non-empty set called universe, and J-V{U) be the set of all fuzzy sets on [/, 
i.e.TV{U) = : f/ — > [0, 1]}. Now, let V be a further non-empty set. Then 

a type 2 fuzzy set on U with respect to V is defined as : U ^ TV{V). Since 
introduced by Zadeh, fuzzy sets of type 2 have been investigated, especially under 
the assumption that V = [0, 1] or V is eventually finite, in both theoretical as 
well as practical aspects in, e.g. [21, 15, 16, 17], among others. 

To interpret context-dependent vague concepts in terms of the fuzzy context 
model, let us consider a linguistic variable L associated with a set of linguistic 
values (or, term-set) T, and the domain of base variable D. Assume further that 
there is a finite, non-empty set C of contexts at which vague concepts in T are 
being conceived / realized. We may think of a probability measure Pc on C as 
weights assigned on contexts or probabilities of randomly selection of contexts 
at which a vague concept is conceived. Then the linguistic variable L can be 
represented as a fuzzy context model 

L={D,C,T,Pc) 

where each linguistic term in T is represented as a context-dependent vague 
characteristic. For t G T, t{c) is called the membership function of t in the 
context c. Then the global membership degree of x G D to the vague concept t 
may be defined as a mapping 
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which is a random variable on [0,1]. Interestingly, this view of vague concepts 
is close to that considered in [17], where is considered as the result of 

translating the uncertainty in data into uncertainty in the membership function. 

Very recently, in [26] the author has introduced the concept of (Kripke se- 
mantics based) context-dependent fuzzy sets as a new approach to type 2 fuzzy 
sets. Let W he a non-empty set of so-called possible worlds. Then a mapping 
'P : U X W — > [0, 1] is called a context-dependent fuzzy set on U with re- 
spect to W. The author also investigated some useful applications of context- 
dependent fuzzy sets in, e.g., interpreting vague concepts, fuzzy approximate 
reasoning, modal fuzzy approximate reasoning [27]. 

It is of interest to note that each context-dependent vague characteristic a 
in the fuzzy context model is formally equivalent to a type-2 fuzzy set 

on C with respect to D in the sense of Zadeh [29] . Furthermore, there is a very 
close interrelation between context-dependent vague characteristics within the 
fuzzy context model and the notion of context-dependent fuzzy sets introduced 
in [26, 27]. Indeed, given a context-dependent vague characteristic a G rc{D), 
we then obtain a context-dependent fuzzy set / defined as follows 

/ : D X C ^ [0, 1] 

(d, c) ^ f{d,c) =def a{c){d) 

On the other hand, if we have a context-dependent fuzzy set d' on D with 
respect to C, then for fixed c G C we define a fuzzy set on D by ^c(d) = 
d'{c, d), for d G D, and put a(c) = ^c- Obviously, a is a context-dependent vague 
characteristic. However, in this case a may be contradictory in general. As such 
the notion of fuzzy context model can also provide a constructive approach to 
fuzzy sets of type 2. The following example is a variant of that taken from [27]. 

Example 2. Let us consider the linguistic variable Amount-of-Money. We now 
consider the linguistic value high of the linguistic variable Amount-of-Money. 
Assume that we have fixed the set C = {01,02,03} of contexts, where 01,02,03 
are the contexts in which the linguistic value high is recognized by a unemployed 
person, a university professor, and an oil sheikh respectively, for instance. Then 
one could fix that for contexts ci , 02 , and 03 the vague concept “high amount 
of money” is interpreted by a fuzzy set on K+ describing about 1 thousand, 
about 1 million, and about 1 billion dollars, respectively. As such the linguistic 
value high of the linguistic variable Amount-of-Money is directly interpreted as 
a context- dependent vague characteristic within a fuzzy context model without 
a finite measure Pc on the set of contexts, namely TM = (M’*', C, T), where T is 
interpreted as corresponding to the set of linguistic values of Amount-of-Money. 

In practical applications we need to operate with not only context-dependent 
vague characteristics within the same fuzzy context model but also with those 
from different fuzzy context models. Important set theoretic operations on con- 
text-dependent vague characteristics as well as methods of approximate rea- 
soning based on the fuzzy context model should be the subject for a further 
research. 
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4.3 Fuzzy Context Model for Modeling Expert Knowledge 

In this section we will see that the notion of fuzzy context model not only gives 
an altenative interpretation for context-dependent vague concepts in connection 
with linguistic variables, but also provides a framework for modeling expert 
knowledge resulted in a uncertainty measure of type 2. 

As observed in Section 2, let us return to a predictive variable p associated 
with a non-empty domain D. Assume that from available statistical data, by 
using traditional techniques of prediction modeling, we may obtain several pos- 
sible predictions for p, represented by subsets of D, as Ai, . . . , A„. At the same 
time, we may also ask domain experts to give their predictions/evaluations often 
expressed linguistically. Then we can model the frame of expert knowledge with 
respect to p as a fuzzy context model 

iCp = {D,c,r,Pc) 

where C is a finite set of domain experts. Pc is a probability distribution on C 
and T is a mapping from C into P{D). One may think of Pc{E), E G C, as 
a weighting assigned to the domain expert E or probability of randomly choice 
of the expert E’s knowledge as a source of testing. Here, for the sake of simplicity, 
we assume that experts’ knowledge are not contradictory, i.e., for each E G C, 
P{E) is a normal fuzzy set in D. 

Formally, the frame of expert knowledge is nothing but an extension of the so- 
called evidential structure in the DS theory. It is also formally equivalent to the 
so-called fuzzy belief structure [2] but, however, the motivation and formulation 
here are somewhat different. 

From a point of view of prediction analysis, based on knowledge of domain 
experts, we intend to evaluate the preference degree Pre(A), for X G 2^, that 
the proposition “p G X” will be true in the future. 

Obviously, due to inherent vagueness and partial conflict of experts’ knowl- 
edge in /Cp, it does not allow us to uniquely determine preference degrees Pre(X), 
for X G 2^ . Also, we can not even calculate lower and upper bounds for them 
as considered in, e.g., [1, 7, 13], but a fuzzy quantity in the unit interval [0, 1]. 
This can be done in terms of the a-cuts of fuzzy sets P{E),E G C, as follows. 

Let a be any real number in (0, 1], and °‘P{E), for any E G C, the a-cut of 
P{E). Then we define 

“Pre(X) = Pc{{E G C| 0 yf °‘P{E) C X}) (3) 

°^'^{X) = Pc{{EGC\°‘P{E)nX^%}) (4) 

For any a,f3 G (0, 1] and a < we have ^P{E) C °‘P{E). It directly follows 
by (3) and (4) that 

“Pre(X) < ^PrelXl: ^Pre(X) < “Pre(X) 

Equivalently, we have 

pPre(X), ^Pre(X)] C [“Pre(X), “Pre(X)j 
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Under such a condition of monotonicity, now we can define Pre(X) as a fuzzy 
set on [0, 1] whose membership function /ipre(x) is defined by 

fipre(x)(f) =sup{a| r e [“Pre(x), “Pre(X)]} 

q: 

As such Pre(Ai) could be considered as the degree of preference, which is 
directly inferred from “vague” knowledge expressed linguistically, in the propo- 
sition “p € X” will be true in the future. 

Under such a procedure, we now obtain a sequence of fuzzy quantities 
Pre(Ai), for i = 1, . . . ,n, as our degrees of preference, quantified based on ex- 
pert knowledge, on possible predictions Ai. Then, the next step in the decision 
process may consist in comparison of the obtained fuzzy quantities. This may 
be done, for example, on the basis of a partial order such as Pre(Ai) < Pre(Aj) 
if and only if 

“Pre(Ai) < “Pre(Aj) 

and 

for any a € (0, 1]. In this case we have to admit indeterminacy when two fuzzy 
degrees of acceptance are incomparable. 

Remark. The manipulation of fuzzy quatities may be considerably simplified 
by restricting the consideration on fuzzy numbers with the LL parameterization 
introduced in Dubois and Trade [3]. Then, as mentioned in Klir and Yuan [18], 
many methods for total ordering of fuzzy numbers that have been suggested in 
the literature can be used in the comparison of fuzzy degrees of preference. It 
should be noticed that in the spirit of previous applications of fuzzy set theory 
to decision analysis, e.g. [4, 6, 25], the utilities were often described in terms of 
fuzzy numbers. 

Example 3. This example models Example 1. Assume that linguistic values very 
high and medium are represented by normal fuzzy sets in D whose membership 
functions are denoted by pvH and /iM> respectively. Then we have: 

ICtemp = {D,C,r,P) 

where D = {-40, . . . , 40}, C = {Ei,E 2 }, P{Ei) = 0.6, P{E 2 ) = 0.4, and 
r{Ei) = pvH, U(i?2) = Pm- 

Assuming that we have to decide a forecasted interval for tomorrow’s tem- 
perature from some predicted intervals of temperature available, say Ti,T 2 ,T 3 . 
By the procedure specified above, we can calculate Pre(Ti) for Ti,i = 1,2,3. 
Then, for instance, the final selection of a prediction can be done on the basis 
of the comparison and ranking fuzzy quantities Pre(Ti) as proposed in [20]. 
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5 Conclusions 

In this paper, we have introduced the notion of fuzzy context model to deal with 
situations of data analysis where imprecision and uncertainty co-exist. From 
a view-point of modeling conceptual knowledge concerning the use of linguistic 
variables, the fuzzy context model gives an alternative interpretation of vague 
concepts which may have different meanings in practice depending on which 
contexts they are uttered in. From a view-point of decision analysis, the fuzzy 
context model also provides an approach to the problem of synthesis of vague 
knowledge linguistically provided by the experts in some practical situations re- 
sulted in a uncertainty measure of type 2. It should be emphasized that the 
notion of fuzzy context model may allow us to model some situations where 
heterogeneous data coming from a variety of sources considered as contexts (es- 
pecially including human-centered systems encapsulating human expertise) have 
to be taken into account [10]. 
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Abstract. It has been developed a C++ program that generates random fuzzy 
relations of a given dimension and computes their T-transitive closure (that 
contains the initial relation) and the new T-transitivized relation (that is 
contained in the initial relation) for the t-norms minimum, product and 
Lukasiewicz. It has been computed several distances between both transitive 
closure and transitivized relation with the initial relation one hundred times for 
each dimension and for each t-norm, and the results show that the average 
distance of the random fuzzy relations with the transitive closure is higher than 
the average distance with the new transitivized relation. 



1 Introduction 

A new method to T-transitivize fuzzy relations [Garmendia & Salvador; 2000] ean be 
used to give new measure of T-transitivity of fuzzy relations. It can also be used to 
build T-transitive fuzzy relations from a given fuzzy relation. 

When the initial fuzzy relation is reflexive, the algorithm generates T-preorders 
that are different to the T-preorders generated form the T-transitive closure. 

The transitive closure of a fuzzy relation contains the initial relation, but the 
transitivized relation is contained in the initial fuzzy relation. 

This paper results are obtained from a C++ program that generate random fuzzy 
relations of a given dimension and computes their Min-transitive closure, Prod- 
transitive closure and W-transitive closure and their Min-transitivized relation, Prod- 
transitivized relation and W-transitivized relation. 

It is computed the measure of T-transitivity of fuzzy relations measuring the 
difference between the transitivized relation and the original one, using several 
distances as the absolute value of the difference, euclidean distances or normalised 
distances. Those distances are also measured between the same random fuzzy 
relations and their T-transitive closures, resulting to be higher than the average 
distances with the T-transitivized relation for all dimensions computed. 
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2 Preliminaries 

2.1 The Importance of Transitivity 

The T-transitive property is held by T-indistinguishabities and T-preorders, and it is 
important when making fuzzy inference to have Tarski consequences. The similarities 
and T-indistinguishabilities generalise the classical equivalence relations, and are 
useful to classify or to make fuzzy partitions of a set. 

Even though not all the fuzzy inference in control needs transitivity, it looks 
important to know whether the fuzzy relation is T-transitive in order to make fuzzy 
inference, and if a relation is not T-Transitive it is possible to find another T- 
transitive fuzzy relation as close as possible with the initial fuzzy relation. 

2.2 Transitive Closure 

The T-transitive closure of a fuzzy relation R is the lower relation that contains R 
and is T-transitive. 

An algorithm used to compute the transitive closure is the following: 

1) R' = R UMax (Rosup-rR) 

2) If R' R then R R' and go back to 1), otherwise stop and R^ R'. 

2.3 A New T-transitivization Algorithm 

At "On a new method to T-transitivize fuzzy relations' [Garmendia & Salvador; 2000] 
it is proposed a new algorithm to T-transitivize fuzzy relations, obtaining a fuzzy T- 
transitive relation as close as possible from the initial fuzzy relation. If the initial 
relation is T-transitive then it is equal to the T-transitivized relation. 

The transitivized relation keeps important properties as the jr-T-conditionality 
property and reflexivity that also preserves the transitive closure, but it also keeps 
some more properties as the invariance of the relation degree of every element with 
himself (or diagonal), and so it preserves a-reflexivity. The transitivity closure do not 
preserve a-reflexivity. 

2.4 Previous Concepts 

Let E = {ai, ..., an} be a finite set. 

Definition 1: Let T be a triangular t-norm. A fuzzy relation R: ExE — ^ [0, 1] is T- 
transitive if T(R(a,b), R(b,c)) < R(a,c) for all a, b, c in E. 

Given a fuzzy relation R it is called element to the relation degree in [0, 1] 
between the elements a-, and aj in E. So ay = R(aj, aj). 

Definition 2: An element a^ is called T-transitive element if T(ai k, aitj) < aij for all k 
from 1 to n. 
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Algorithm: The proposed algorithm transform a fuzzy relation R° into another T- 
transitive relation Rj eontained in R° in n^-1 steps. In eaeh step ean be redueed some 

degrees so R = R °3 R' R™ 3...3 R" = Rt. 

The idea of this method is to get profit of the fact that each step makes sure that an 
element will be T-transitive for all further steps, and so it will be T-transitive in the 
final relation Rj In summary, each step m+1 T-transitivize an element aij” in R” 
reducing other elements ai_k™ or a^j™, when it is necessary, resulting that a-,/ is T- 
transitive in R'^ for all r>m. To achieve this, it is important to choose in each step the 
minimum non T-transitivized element as the candidate to transitivize (reducing other 
elements). When choosing to transitivizate the minimum aij™ in R“ it is sure that aij” 
= ai j"^ for all r>m (it will not change in further steps), because the reduction of other 
elements will not make it intransitive anymore and because a;j™ is lower or equal 
further transitivized elements, it will not cause intransitivity and it will not be 
reduced. 

Let X be a set of pairs (i, j) where i, j are integers from 1 to n. 

Definition 3: x“ is a subset of x defined by: 

1) x °=0 

2) x“^’ = X™ U (i, j) if ai j™ is the element in R™ chosen to be T-transitivized in the 
m-i-1 step. 

So X™ is the set of pairs (i, j) corresponding the T-transitivized elements in R™ and 
(x™)' is the set of n^-m pairs (i, j) corresponding the not yet transitivized elements. 

Bnilding from R™: Let ai j™ be the element in R™ that is going to be transitivized 
at step m-l-1 (aij™= Min{av,w™ such that (v, w) e (x™)'}). 

It is defined a^^s™^' as 



r(a™,a™) ifp=i,T(a™,a™)>a™ anda™<< 
ifs=j,T(a™, a™J>a™ and a™>a; 



a”) 

a™ 



( 1 ) 



otherwise 



where is the residual operator of the t-norm T, defined by (x, y) = sup{z/ T(x, z) 
}■ 

If T(ai_k™, a]jj™) > aij™ for some k, either ai_k™ or a^j™ will reduce its degree (it could 
be chosen the minimum of both) to achieve that T(aij£™^*, a^ j™^') < aij™^* = a^™. 

When choosing the minimum between ai j^™ and aj^j™ to reduce, if it is chosen the 
minimum one, the difference between R™ and R™^’ is lower, so if ai ^™ < a^ j™ then 
= J^(akj™, aij™) and if ai_k™ > a^j™ then nkj™"^' = J^(ai_k™, a^™). The degree of the 
rest of elements remains invariant (aj s™^’= a^ s™). 
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3 The Program 

3.1 Program Description 

It has been developed a program in C++ that generates a random fuzzy relation 
(shown at the top of the figure) and computes the Min-transitive closure, Prod- 
transitive closure and W-transitive closure (first row of relation in the figure), 
measuring the absolute value distance and euclidean distance with the initially 
generated fuzzy relation. It also computes the Min-transitivized relation, Prod- 
transitivized relation and W-transitivized relation (second row of relations in the 
figure), and also measures their distances with the same original fuzzy relation. 



|:ji Comparacidn del cierre T-Transitivo con nuevas tecnicas de T-transitivizacidn 



ArcNvo Opciones 



Mefm 



Cambiar Dimension 



i,Reflexiva? 



Histograma dist1 



[III 



38 34 35 33 35 32 



Cierte Min-transit. Itet= 4 lnci= 117 

d1=38,e eucM.8021 

disti maxima 0,9 en (4, 1) 



0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 
0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 



0.9 0.9 0.9 1 
0.9 0.9 0.9 1 
0.9 0.9 0.9 1 
0.9 0.9 0.9 1 
1 0.9 0.9 1 
0.9 0.9 1 1 
0.9 0.9 0.9 1 



0.9 1 0.9 0.9 1 
0.9 0.9 0.9 0.9 1 
0.9 0.9 0.9 0.9 1 
0.9 0.9 0.9 0.9 1 
1 0.9 0.9 0.9 1 
0.9 1 0.9 0.9 1 
0.9 0.9 0.9 0.9 1 



0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 

Matriz Min-Transitivizada 

d1=34,2 8UCM.B043 

disti maxima 1 en [3, 6) 70 reducidos 



HU 0.1 0.9 0 0.9 0.4 0.4 0.4 0.2 0.7 

0.5 0.5 0.9 0.2 0.7 0.8 0.6 0.9 0.1 0.7 

0.8 0.8 0.3 0.5 0.4 1 0.5 0,3 0.9 1 

0 0,7 0,1 0.6 0.5 0.5 0.9 0.4 1 0.4 

0 0,3 0,7 0.4 0,1 0,7 0.3 0.8 1 0,3 

0.3 0.9 0.9 1 0.5 0 0.7 0.6 0,2 0.6 

1 0,2 0.6 0.7 1 0.6 0.5 0 0.3 0.9 

0.6 0,3 1 0.9 0,1 0,4 0.4 0.7 0,7 0.3 

0.1 0,9 0.5 1 0,4 0.1 0.4 0.6 0,9 0.9 

0.1 0,1 0.5 0.5 0,3 0.8 0.6 0.2 0,3 0.2 

Cierte Prod-transit. Iler= 4 lncr= 1 1 6 
d1 =35.997 eud=4.5652 

disti maxima 0,9 en (4, 1) 



raHl 0,81 0.9 0,9 0.9 0.9 0.81 0,720.9 0.9 
0.81 0.81 0,9 0.9 0.81 0.9 0.81 0.9 0.9 0,9 

0.9 0.9 0.9 1 0.9 1 0.9 0.81 1 

0.9 0.9 0.81 1 0.9 0.81 0.9 0,81 1 

0.9 0.9 0.81 1 0.9 0.81 0.9 0.81 1 

0.9 0,9 0,9 1 0.9 0.9 0,9 0.81 1 

1 0.9 0.9 1 1 0.9 0.9 0.81 1 

0.9 0,9 1 1 0.9 1 0,9 0.81 1 

0.9 0,9 0.81 1 0.9 0.81 0.9 0,81 1 



0.9 

0,9 

0,9 

0.9 

1 

0.9 



0.72 0.72 0.720.8 0.720.8 0.72 0.640.8 0.72 

Matriz Prod-T ransitivizada 

d1=33.617eud=4.5665 

disti maxima 1 en [3. 6] 69 reducidos 



D isti , D im: 1 0 matrices; 1 01 
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Cierre W-transil. Iter= 4 lncr= 113 
d1=35.6 eucM.5321 

disti maxima 0,9 en [4, 1] 

M 0.8 0,9 0.9 0.9 0.9 0.8 0.7 0.9 0.9 
0.8 0,8 0,9 0.9 0.8 0,9 0.8 0.9 0.9 0.9 



0.9 0,9 0,9 1 0.91 0.9 0.8 1 
0.9 0.9 0,8 1 0.9 0,8 0.9 0.8 1 

0.9 0.9 0,8 1 0.9 0.8 0,9 0.8 1 

0.9 0.9 0,9 1 0,9 0.9 0,9 0.8 1 

1 0,9 0,9 1 1 0.9 0.9 0,8 1 

0.9 0.9 1 1 0,9 1 0.9 0.8 1 

0.9 0,9 0.8 1 0.9 0.8 0.9 0.8 1 
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Fig. 1. General front-end of the program 



As an example, the program generates the following random fuzzy relation: 






S90&06D409fi6Q7D6a 09 aC 
Q61 1 06Q9Q 0 02040 

06090.70 0602091 0409 

02060.71 OCO1Q70IOSO 
09040 I 0J1 09040306 
01060306020307090907 

090 09090 0 090.1 01 
I 0701 07050902060705 
107070 07010 noonnoof^ 



Fig. 2. Example of generated random fuzzy relation 
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Computes the Min-transitive closure, Prod-transitive closure and W-transitive 
closure measuring the absolute value distance and euclidean distance with the initial 
fuzzy relation: 



Cierre Min-hansit. Itei= 4 lncr=117 
d1=38,Geud=4.8021 
disH maKima 0.9 sn [4, 1 ) 



Cierre Prod-lransil. Iter= 4 tncr= 116 
d1 =35,397 eucM.5852 
di$t1 maxima 0,9 en [4, 1 ) 



[iE]0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 
0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 
0.9 0.9 0.91 0.9 1 0.9 0.91 1 


Sinn 0.81 0.9 
0.81 0.81 0.9 
0.9 0.9 0.9 


0,9 0.9 0,9 0.81 0.720.9 0,9 

0,9 0.81 0,9 0.81 0.9 0.9 0,9 

1 0.9 1 0.9 0.81 1 1 


0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 1 0.9 

0.9 0.9 0.90_0.9 0.9 0.9 0.9 1 0.9 

0.9 0.9 0.911 0.9 0.9 0.9 0.91 0.9 

1 0.9 0.9 1 1 0.9 0.9 0.9 1 0.9 


0.9 0.9 0.81 1 0.9 0.81 0.9 

0.9 0.9 0.ei,1_0.9 0.810.9 

0.9 0.9 0.9 1 .0.9 0.9 0.9 

1 0.9 0.9 1 1 0.9 0.9 


0.81 1 0.9 

0,81 1 0.9 

0.81 1 0.9 

0.81 1 0.9 


0.9 0.9 1 1 0.9 1 0.9 0.9 1 1 


0.9 0.9 1 


1 0.9 1 0.9 


0,81 1 1 


0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 1 0.9 


0.9 0.9 0.81 1 ,0.9 0.81 0.9 


0.81 1 0.9 



0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 



0.720.72 0.720.8 0.720.8 0.720.84 0.8 0.72 



Cierre W-(iansit. I(er= 4 lnci= 113 
d1=35.G eucM.5321 

di$n maxima 0.9 en (4, 1 ) 

’ [ills 0.8 0.9 0.9 0.9 0.9 0.8 0.7 0.9 0.9 
0.8 0.8 0.9 0.9 0.8 0.9 0.8 0.9 0.9 0.9 
0.9 0.9 0.91 0.9 1 0.9 0.8 1 1 

0.9 0.9 0.8 1 0.9 0.8 0.9 0.8 1 0.9 

0.9 0.9 0.81 0.9 0.8 0.9 0.8 1_ . 0.9 
0.9 0.9 0.91 0.9 0.9 0.9 0.8 1 .0.9 

1 0.9 0.91 1 0.9 0.9 0.8 1 0.9 

0.9 0.9 1 1 0.9 1 0.9 0.8 1 1 

0.9 0.9 0.8 1 0.9 0.8 0.9 0.8 1 0.9 

0.7 0.7 0.7 0.8 0.7 0.8 0.7 0.8 0.8 0.7 



Fig. 3. Example of Min-Transitive closure, Prod-transitive closure and W-transitive closure of 
the random fuzzy relation of Fig. 2, measuring the absolute value distance and euclidean 
distance with the initial fuzzy relation 



It also computes the Min-transitivized relation, Prod-transitivized relation and W- 
transitivized relation (second row of relations in the Fig.), and also measures their 
distances with the same original fuzzy relation: 

I I I 

Malriz MiivTransilivizada Mahiz Ptod-Ttansilivizada Mahiz W-Tiansitivizada 

d1 =34.2 eucl=4.G043 d1 =33,61 7 eucl=4,5GG5 dl =33 eucl=4,4587 

disti maximal en(3, 6) 70reducidos dist1 maximal en(3, 6] 63reducido$ disti maxima 0,9 en (1. 3) 85reducidos 
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Fig. 4. Example of Min-transitivized relation, Prod-transitivized relation and W-transitivized 
relation of the random fuzzy relation of Fig. 2, measuring the absolute value distance and 
euclidean distance with the initial fuzzy relation 

After doing this process 100 times, the program shows the percentage of times that 
the T-transitivized relation have a lower distance with the random relation than the 
distance of the T-transitive closure with the initial relation. For the minimum t-norm, 
the 85% of tries the distance with the Min-transitivized relation is lower than the 
distance with the Min-transitive closure. This percentage is 53% when T is the 
product t-norm, and for the Lukasiewicz t-norm a 84% of times is closer the W- 
transitivized relation than W-transitive closure: 
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85 ^ 




Fig. 5. After doing the Fig 2-3-4 process 101 times, the program shows the percentage of times 
that the T-transitivized relation have a lower distance with the random relation than the 
distance of the T-transitive closure with the initial relation 

It can also tell the program to generate reflexive fuzzy relation, and then there are 
generated two Min-preorders (the Min-transitive closure and the Min-transitivized 
relation), two Prod-preorders and two W-preorders. 

When choosing to generate reflexive and symmetric random fuzzy relations their 
computed T-transitive closures will be generated T-indistinguishabilities. The original 
transitivization method described does not keep the symmetry but we already have 
developed a version to transitivize fuzzy relation keeping the symmetry (when 
reducing an element, it is also reduced its symmetric element) and then obtaining T- 
indistinguishabilities. 

^^GeneraMatiiz^J 
^Cambiai^Dmer^^ 
ij.Refleniva? | 
tSimetiica? | 

Fig. 6. Buttons to start a new process, and to choose the properties of the generated fuzzy 
relation, as the dimension, the reflexive property and the symmetric property 



I excel I j|50^^&c^ 

Fig. 7. The program have buttons to repeat the process fifty times and keep the results in an 
Excel document 

The histogram shows the absolute value distance of the last random generated 
fuzzy relation with the (in this order from the left to the right) Min-transitive closure, 
the Min-transitivized relation, the Prod-transitive closure, the Prod-transitivized 
relation, the W-transitive closure and the W-transitivized relation. The graph at the 
right of the picture compares the absolute value distances of both T-transitivization 
methods for the t-norms (in this order, from the upper to the lower graphs) minimum, 
product and Lukasiewicz for the last hundred of random fuzzy relations. In most 
cases, the distances of the T-transitivized relation is lower than the distances with the 
T-transitive closure for the three t-norms. 
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Fig. 8. The histogram shows the absolute value distance of the last random generated fuzzy 
relation with the Min-transitive closure, the Min-transitivized relation, the Prod-transitive 
closure, the Prod-transitivized relation, the W-transitive closure and the W-transitivized 
relation. The graph at the right of the picture compares the absolute value distances of both T- 
transitivization methods for the t-norms minimum, product and Lukasiewicz for the last 
hundred of random fuzzy relations 



The program has been scheduled to generate one hundred of random fuzzy 
relations for each dimension from two to one hundred. The average distances for each 
dimension have been saved in an Excel document. 



4 Program Work 

It has been run the program one hundred times for each dimension from two to one 
hundred, it is, the program has generated 9900 random fuzzy relations, computing 
their T-transitive closures and their T-transitivized relations for different t-norms, and 
computing their average distance of absolute value and euclidean for each dimension. 

The function in the graph below represents, for each dimension, the average 
absolute value distance with their W-transitive closure (the line of higher distances) 
and the W-transitivized relation. The aspect of the results could change when using 
other distances, but it is got the same looking for the three t-norms used. 




Fig. 9. Average of the absolute value distances of 100 random relations with their W-transitive 
closure and W-transitivized relation for each dimension from two to one hundred 

The functions for those average distances for the t-norm minimum, product and 
Lukasiewicz are the following: 
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Table 1. Interpolation function of the average absolute value distance of the W-transitive 
closure and W-transitivized relation of one hundred random fuzzy relations for each dimension 
from two to one hundred 



Absolute value 
distance 


Min 


Prod 


W 


Transitive 

Closure 


y=0,5x^+l,19x-16,27 


y=0, 6x^-3, 4x+5 


y=0,51x^+0,49x 


Transitivized 

relation 


y=0,47x^-l,27x+5,9 


y=0,47x^-l,23x+5,l 


y=0,46x^-l,72x+4,74 



The average distances of the generated relations with the transitive closure is 
higher that for the transitivized relation for all dimensions and for all t-norms. 



However when using the euclidean distances it is also got higher distances for the 
T-transitive closure for the three t-nonns, but we get linear functions: 



Min-transitive closure and Min-transitivized relation 




Fuzzy relation dimension 



Fig. 10. Average of the euclidean distances of the Min-transitive closure and Min-transitivized 
relation of one hundred random fuzzy relations for each dimension from two to one hundred. 



The linear functions resulting when using euclidean distances are the following: 



Table 2. Interpolation function of the average euclidean distance of the T-transitive closure and 
T-transitivized relation of one hundred random fuzzy relations for each dimension from two to 
one hundred 



Euclidean 

distances 


Min 


Prod 


W 


Transitive 

Closure 


y=0,61x-0,42 


y=0,61x-0,63 


y=0,61x-0,68 


Transitivized 

relation 


y=0,56x-0,76 


y=0,56x-0,77 


y=0,56x-l,19 



As the mean distances of the T-transitive closure are higher than the mean 
distances for the T-transitivized relations, we have study the difference. The graph 
below shows those difference between the means using the absolute value distance 
and the minimum t-norm, for dimensions from two to one hundred: 
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Differences between the average absolute value 
distance 

600,00 

400.00 

200.00 
0,00 




Fig. 11. Differences between the average absolute value distance of the 100 generated relations 
with their Min-transitive closure and Min-transitivized relations, for dimensions from 2 to 100 

Some statistical values for those 9900 generated relations and their transitivized 
relations using the absolute value distance are the following: 



Table 3. Statistical values of all absolute value distances with the transitivized relations for the 
9900 fuzzy relations generated 



Absolute value 
distance 


Minimum 


Product 


Lukasiewicz 


Transitive 

closure 


Algorithm 


Transitive 

closure 


Algorithm 


Transitive 

closure 


Algorith 

m 


Mean (average) 


1702 


1492 


1701 


1491 


1701 


1456 


Standard 

deviation 


1515,47 


1350,05 


1516,42 


1349,09 


1516,60 


1326,66 


Second quartile 


1296,3 


1111,0 


1296,3 


1109,7 


1296,3 


1079,1 


First quartile 


335,4 


279,0 


334,3 


278,5 


334,2 


264,0 


Third quartile 


2846,8 


2504,9 


2846,8 


2502,7 


2846,8 


2448,8 



Table 4. Statistical values of all Euclidean distances with the transitivized relations for the 
9900 fuzzy relations generated 



Euclidean 

distance 


Minimum 


Product 


Lukasiewicz 


Transitive 

closure 


Algorithm 


Transitive 

closure 


Algorithm 


Transitive 

closure 


Algorithm 


Mean (average) 


30 


27 


30 


27 


30 


27 


Standard 

deviation 


17,38 


16,22 


17,47 


16,22 


17,49 


16,16 


Second quartile 


30,1 


27,4 


30,1 


27,4 


30,1 


26,9 


First quartile 


15,2 


13,6 


15,1 


13,6 


15,1 


13,1 


Third quartile 


44,6 


41,3 


44,6 


41,3 


44,6 


40,7 
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5 Results Analysis 

After generating 100 random fuzzy relations for all dimensions from 2 to 100, and 
eompute their average distanee with the T-transitive elosure and with the T- 
transitivized relation, we have seen for any distanee, for any t-norm and for any 
dimension that the T-transitivized relation is eloser to the initial relations than the T- 
transitive elosure. 

When obtaining global measures for the 9900 relations, the transitivized relation is 
also eloser than the transitive elosure, and has lower dispersion. 

6 Conclusions 

The T-transitivization algorithm gives eloser T-transitive relations than the T- 
transitive elosure for any dimension and any t-norm. They are also different, beeause 
gives T-transitive relations eontained in the initial relation. 

The T-transitive elosure is uniquely defined, however we ean find several T- 
transitive relations eontained in the initial relation. 

It is proven [Garmendia & Salvador; 2000] that the T-transitivization algorithm 
keeps the reflexivity and a-reflexivity. However the T-transitive elosure keeps 
reflexivity, but not a-reflexivity. However the algorithm does not keep symmetry as 
the transitive elosure does, and so it does not produee T-indistinguishabilities from 
reflexive and symmetrie relations. We have already developed a new version that 
does keep it, redueing the symmetrie element of all redueed elements. 
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Abstract. In this work we study the interpretation of some fuzzy inte- 
grals (Choquet, Sugeno and twofold integrals). We give some examples 
of their use and from them we study the meaning and interest of the 
integral. We show that fuzzy inference systems, for both disjunctive and 
conjunctive rules, can be interpreted in terms of Sugeno integrals. This 
permits to consider a new field for the application of Sugeno integrals. 

Keywords: Fuzzy integrals, Sugeno integral, Fuzzy inference system. 
Twofold integral 



1 Introduction 

Although fuzzy integrals have been studied for a long time and they have been 
proven to be powerful operators, the number of working applications is still 
limited. One of the causes for such limitation is that they require the definition 
of fuzzy measures and such measures need 2" parameters (where n is the number 
of information sources). Thus, there is a curse of dimensionality. Another cause is 
the difficulty in grasping the meaning of such measures and integrals. Then, not 
having a clear interpretation of fuzzy measures at mind, it is extremely difficult 
for an expert to define a large burden of numbers (with several constraints so 
that they correctly define a fuzzy measure). In this work we describe several 
interpretations of fuzzy measures when used in conjunction with some fuzzy 
integrals. 

Among the existing fuzzy integrals, the most well-known ones are the Cho- 
quet [4] and the Sugeno [11] integrals. Choquet integrals can be interpreted as 
a generalization of expectation in the case that we have a non-additive mea- 
sure [8]. In general, for a probability distribution (an additive measure), the 
Choquet corresponds to the expectation and therefore, the contribution of a par- 
ticular value to the integral is just its probability. In the case of non-additive 
measures, we have that the contribution for a value a is a function of the measure 
of all values larger than a. 

As this interpretation does not fit at all with the Sugeno integral, alternative 
interpretations are needed. Nevertheless, the expression for the Sugeno integral 



V. Torra and Y. Narukawa (Eds.): MDAI 2004, LNAI 3131, pp. 316—327, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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has several resemblances with the one for the Choquet integral. Both combine 
values with a fuzzy measure. In the case of the Choquet integral, addition and 
multiplication is used for such combination. Instead, in the case of the Sugeno 
integral, maximum and minimum is used. These resemblances were exploited 
by Murofushi and Sugeno to define the t-conorm integral [7] (a generalization 
of both Sugeno and Choquet integrals). In short, this integral generalizes both 
maximum and addition in terms of a t-conorm and both minimum and product 
in terms of a t-norm like operator. 

In this work, we study the interpretation of the Sugeno integral and the 
Choquet integral and also of the twofold integral [13] that is a generalization of 
the former integrals. The structure of the paper is as follows. In Section 2, we 
review some definitions and results. Then, Section 3 gives interpretations of the 
Sugeno and the Choquet integrals. Section 4 studies the interpretation of the 
twofold integral. The paper finishes in Section 5 with some conclusions. 

2 Preliminaries 

In this section, we present some preliminary definitions and properties that are 
used in the rest of this paper. In particular, we will define fuzzy measures (on 
a finite universal set X) and the following fuzzy integrals: Choquet, Sugeno and 
twofold. Additionally, we will define the weighted minimum and the weighted 
maximum. See [8] for details on fuzzy measures and fuzzy integrals and [2] for 
a broader view of the field of aggregation operators. 

Definition 1. A set function fj, : 2^ — > [0, 1] is a fuzzy measure if it satisfies 
the following axioms: 

(i) = 0 , !Jl{X) = 1 (boundary conditions) 

(ii) Ac B implies p.{A) < fi{B) (monotonicity) for A, B G 2^ 

Definition 2. Let p be a fuzzy measure on (X,2^). The Choquet integral 
Cfi{f) of f : X ^ with respect to p is defined by 

n 

cM) = T. 

where indicates that the indices have been permuted so that 

0 ^ /(^s(l)) — ‘‘‘ — f (^s(n) ) ^ 5^s(n)}; ^s(n+l) 

Definition 3. [11] The Sugeno integral S^(f) of a function / : A — > [0, 1] with 
respect to p is defined by 

n 

■= V ^ AiMj)) 

i=i 

where f{xs(i)) indicates that the indices have been permuted so that 

0 ^ / (^s(l ) ) ^ ‘ ‘ ^ f (^s(n) ) ^ 1; -^s(i) ~ {^s(z) ; ‘ ‘ ‘ 5 ^s(n) } ; -^s(n+l) ~ 0- 
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Definition 4. [13] [10] Let o,nd /is he two fuzzy measures on X, then the 
twofold integral of a function / : X ^ [0, 1] with respect to the fuzzy measures 
/is and fjLc is defined by: 

n i 

'^^us,ucif) ~ ^ ^ \/ fi^s(j)) A^s(^s(j))) ~ Mc(^s(z+1)))^ 

i=l i=l 

where f(xs{i)) indicates that the indices have been permuted so that 

^ — f (^s(l ) — f (^s(n) ) ^ f : -‘^s{i) i^s(n)}; ^s(n+l) 0- 

Definition 5. A vector v = (vi, ...,vn) is a possibility distribution or a possi- 
bilistic weighting vector of dimension N if and only if Vi G [0, 1] and maxi Vi = 1. 

Definition 6. [5] Let \i be a weighting vector of dimension N, then a mapping 
WMin: [0, l]'^ — *■ [0, 1] is a weighted minimum of dimension N if 

WMinu{ai, un) = minmax(l — Ui, at). 

i 

Definition 7. [5] Let \i be a weighting vector of dimension N, then a mapping 
WMax: [0, 1]^ — > [0, 1] is a weighted maximum of dimension N if 

WMaXu{ai, un) = maxmin(ui, Oi). 

i 

3 Semantic Interpretation 

In this section, we shortly review an interpretation for the Choquet integral and 
then we turn into the ones for the Sugeno integral. 



3.1 Intepretation of the Choquet Integral 

It is well known that the Choquet integral generalizes the weighted mean. On 
this basis, and taking into account that the weighted mean can be understood as 
an expected value, the Choquet integral can be interpreted as a kind of expecta- 
tion. The difference with the weighted mean relies in that instead of a probabil- 
ity distribution a kind of probability-like measure is used. Such probability-like 
measure is due to some uncertainty on the probability itself. 

3.2 Interpretation of the Sugeno Integral 

To interpret the Sugeno integral, let us start considering the integral in an ordinal 
setting. In this case, it is clear that given a set X, both f{x) (for x € X) and 
n{A) (for A C X) should be into the same domain D. Otherwise, the integral 
cannot be applied because the minimum cannot be applied to /(x) and fJ,{A). 
So, in some sense, both p, and / should denote the same concept. As p denotes 
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Table 1 . Fuzzy measure for the traveler example 



set 


{a;i} 


{*2} 


{2:3} 


{X1,X2} 


{2:2, 2:3} 


{2:1, 2:3} 


A 


P 


0.7 


0.5 


0.2 


0.9 


0.6 


0.8 


1 



some importance, reliahility, satisfaction or a similar concept, the same should 
apply to /. Accordingly, the Sugeno integral combines a kind of e.g. importance 
or reliability leading to another value for importance or reliability. In fact, there 
are some applications of Sugeno integral in the literature {e.g. [11] ) that fit with 
this perspective. 

We illustrate this situation with the following scenario related to reliability. 
This is, the domain D stands for some kind of reliability. Let us consider some 
experts X = {x\,X2, . ■ . ,Xn}. These experts evaluate the reliability of a given 
machine. Say a new Japanese copy machine. Then, f{xi) is the reliability of 
such copy machine according to expert Xi. Then, we also consider the reliability 
of subsets of experts. So, p{A) is the reliability of experts in A all together. 
Any ordered set D is appropriate in this case to express the reliabilities, and for 
applying the Sugeno integral we consider the same set D to express reliability 
of experts and reliability of copy machines {i.e., forall A C A we have that 
p{A) G D and for all Xi £ X we have that f{xi) G D). 

For example, let X = {xi,X2,X3} be a set of 3 experts, then with yi{{xi\) = 
0.2, pl{{x2}) = 0.3 and ^{{xs}) = 0.4 we express that the expert x^ is more 
reliable than X2 and that X2 is more reliable than x\. Let p{{xi,X2}) = 0.3 
and p{{xi,x^'\) = 0.4 represent that joining x\ to X 2 or to X 3 does not imply 
a larger reliability than the one of X 2 or x^ alone. Instead, with pl{{x 2 ,x^}) = 0.8 
we express that joining both X 2 and X 3 their reliability is greatly increased. 
Finally, we set p{%) = 0 and p{X) = 1, following boundary conditions. Now, let 
f{x\) = 0.3, f{x 2 ) = 0.7 and f{xz) = 0.6 be experts’ opinions on the reliability 
of the copy machine. In this case, the Sugeno integral leads to 0.6, the value of 
one of the most relevant experts. 

We illustrate the interpretation outlined above with another example corre- 
sponding to alternative selection. 



Sugeno Integral for Alternative Selection Let us consider a traveler in 
Japan that intends to visit Tokyo, Kyoto and Nagano and considers several 
alternative places for staying. Then, let X = {xi, X 2 , X 3 } denote the three men- 
tioned cities. That is, xi corresponds to Tokyo, X 2 to Kyoto and X 3 to Nagano. 
Then, we consider the degree of satisfaction of the traveler visiting such cities. 
Such degree is expressed with the fuzzy measure piA) described in Table 1. 

Now, let us consider the accessability of such towns when the traveler is 
located at Tsukuba. In this case, Tokyo is the most accessible city, then Nagano 
and finally Kyoto. Table 2 gives measures of such accessability from Tsukuba. 
Such measures are expressed in the same terms as the degree of satisfaction p. 
This is, f{x) is comparable with p{A). 
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Table 2. Accessability degrees from Tsukuba 



set 


Xl 


X2 


X3 


/ 


00 

o 


0.4 


0.5 



Table 3. Satisfaction degree for each city for the traveler example 



set 


Xl 


X2 


X3 


A/ 


0.7 


1 


00 

o 



Now, we define ^J-f{xi) := /x({x|/(x) > f{xi)}). This expression can be un- 
derstood as the degree of satisfaction of visiting Xi and all those cities that are 
at least as accessible as Xi. Roughly speaking, in what accessibility concerns, 
if we visit a place xt with a given accessibility /(xi), then we assume that all 
places with a greater accessibility are also visited. So, we consider each f{xi) as 
a threshold for selecting the visits. Therefore, if we visit Xi we will visit as well 
{x\f{x) > f{xi)}, then fJ.f{xi) is the degree of satisfaction of visiting such Xi. 
Table 3 gives the values for the example considered above. 

The next step is to consider for each city Xi, both degrees of accessibility f{xi) 
and satisfaction ^f{xi). When f{xi) > ^f{xi), the traveler makes much account 
of satisfaction, because it is easy to access to Xi, that is, the degree of Xi cannot 
be larger than ^f{xi). When ^if{xi) > f{xi) the traveler must attach special 
importance to physical accessibility, that is, the degree of Xi corresponds to 
f{xi). Accordingly, both degrees are combined by means of the A (the minimum) 
operator. Thus, f{xi) A !Jif{xi) is the evaluation of going to Xi. 

Thus, the place Xi with the largest evaluation f{xi) A ^J,f{xi) stands for the 
evaluation of staying in Tsukuba. This largest evaluation corresponds to the 
Sugeno integral of / with respect to /r, that in this example is equal to 5'/^(/) = 
maxa;, f{xi) A fJ.f{xi) = 0.7. 

When another alternative is considered, the function / changes but the pro- 
cess is analogous. In this way, if the traveler stays in Osaka, the measure /r in 
Table 1 is still valid but an alternative function g is required. This function mea- 
sures the accessibility of going to Xi from Osaka. Table 4 displays such function. 
In this case, the Sugeno integral leads to SIfj,{g) = 0.6 (function g,g is given in 
Table 5). 

As, SIfj,{g) < it means that the traveler will chose to stay in Tsukuba 

instead of staying in Osaka. 



Sugeno Integral for Fuzzy Inference Systems Now, we consider a different 
scenario where certainty degrees play the central role. Let us consider rules (or, 
in general, any knowledge based system) that assigns such degrees to a particular 
output value. E.g., a rule of the form 

Ri: If X is Ai then y is Bi 

assigns a certainty degree of 0.9 to j/ being 5 when x is xq. In this case, when sets 
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Table 4. Accessibility degrees from Osaka 



set 


Xl 


X2 


X3 


g 


0.4 


1 


0.7 



Table 5. Satisfaction degree for each city for the traveler example when staying in 
Osaka 



set 


a:i 


X2 


X3 


Ms 


1 


0.5 


0.6 



of rules are considered and they conclude on the same output value, the degrees 
should be somehow aggregated. When X denotes the set of rules that conclude 
about such output value (5 in the example), f{xi) denotes the certainty degree 
that rule Xi assigns to 5 , and ^^{A) is the certainty of the set A of rules. In such 
situation, the Sugeno integral could be used to compute the certainty of such 
value 5 . 

This latter scenario is consistent, as will be shown below, with fuzzy inference 
and links inference with Weighted Minimum and Weighted Maximum and the 
latter operators with the Sugeno integral (one of their generalizations) . 

The next two sections describe fuzzy inference. First we consider the case 
of disjunctive rules and, then, the case of conjunctive rules. As we see it, the 
examples considered validate our interpretation of the Sugeno integral. The use 
of Sugeno integral for disjunctive rules was previously suggested in [ 12 ]. 



The Case of Disjunctive Rules In this section, we consider the application 
of a fuzzy inference system defined in terms of several disjunctive rules. 

Example 1 . Let us consider a fuzzy inference system FIS defined in terms of 6 
disjunctive rules. For the sake of simplicity, we consider a single input variable x 
described in a given domain X and a single output variable y described in 
a domain Y . Then, to define the system we need a set of fuzzy sets Ai on X 
and Bi on Y . Naturally, Ai X and Bi C Y for all i = 1 , ... ,6. Using fuzzy 
sets Ai and Bi we define the following set of rules: 

i?i: If cc is Ai then y is Bi 
i?2= If cc is A2 then y is B2 
R3: If cc is A3 then y is B3 
i?4: If a: is A4 then y is B4 
R^: If a: is A5 then y is R5 
Rq: If a: is Ag then y is Bq 

Given such a system, the output of the system for a given input a;o is com- 
puted as the combination of the outputs of each rule. All rules are fired and 
the degree of satisfaction of each antecedent is computed. This corresponds to 
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compute Oi, where at corresponds to the degree of satisfaction of “xq is Ai” . In 
our case, as there is a single condition in the antecedent, = Ai(xo)- Once 
is known, the conclusion of rule Hi can be computed. 

For systems defined in term of disjunctive rules, the output of a rule is often 
computed using Mamdami’s approach. Mamdami’s approach is equivalent to 
computing the output for A' = {a:o} as either Uj(A' o Rj) or A' o (UjRj) with o 
being a max-min composition and Rj being the intersection of Aj and Bj. See 
e.g. [6] for a proof. We use Mamdami’s approach computing 'Oj{A' o Ri). 

From an operational point of view, Mamdami’s approach is as follows: for 
each rule Ri, its output fuzzy set Bi is clipped according to the degree of satis- 
faction ai- According to this, the output of such fuzzy rule Ri is Bi A Ai{xo). 

Then, the procedure follows with the union of all the outputs. This is, the 
fuzzy output B (for the whole system) is computed as the union of the outputs 
of each rule Ri. Using, maximum for union (the most usual operator) the output 
B becomes: 

U = vLi(B,AA,(xo)) 

Finally, the output fuzzy set B is usually defuzzified. In what follows, we skip 
the defuzzificatioin stage as it is not relevant for our study. 

Let us now consider the membership of the fuzzy output B for a given ele- 
ment yo in Y. This is, the value of B(yo). It corresponds to: 

■B(yo) = V-^i(Si(yo) A Ai{xo)) 

Such expression can be seen on the light of the weighted maximum. This 
operator, recalled in Section 2 is defined as: 

WMaXu{ai , ..., qn) = maxmin(ui, a^) 

i 



Therefore, 



B{yo) = WMax^{Bi{yo ), . . . ,UAr(yo)) (2) 

where the weighting vector is u = (Ai(a;o), . . . , A 7 v(a:o)) or, in general, u = 
(oi, . . . , ttAf). Note that the weighting vector is independent of the value yo, 
thus for a given xq, we are applying the same aggregation method with the same 
parameterization u for all the yo in Y . 



The Case of Conjunctive Rules In the case of a set {Rj}j of conjunctive 
rules, there are two alternative expressions to compute the output for an in- 
put fuzzy set A' . Such expressions are: i^j{A' o Rj) and A' o (fljR^). Although 
these two expressions do not lead, in general, to the same output they are equal 
when A' is a single value (our case). Due to this, we will use C\j{A' o Rj) because 
it is more appropriate for illustration. 
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Example 2. Let us consider again the set of rules given above in Example 1 and 
their application when the input is xq. Then, from the operational point of view, 
we need to compute A' o Rj for each rule Rj and then the intersection of all 
these outputs. Using minimum (denoted A) for the intersection, we have: 

B = o R,) 

or, alternatively, for all yo G Y 

B(yo) = ALi((^'oi?.)(2/o)) 

where Ri is the relation built from Ai and Bi using an implication function I 
(see [6] for details on implication functions and for a description of several fam- 
ilies). That is, Ri = I{Ai,Bi). However, as A! = {xq} we have that A o Ri 
corresponds to I{Ai{xo), Bi{y)) and then we can write: 

B{yo) = A®=i(/(H,(a:o),.B,(yo))) 

Under the light of the weighted minimum (see Definition 6): 

W ..., oat) = minmax(l — Ui, Ui) 

i 

we can select an appropriate I so that the expression above for B{yo) is expressed 
in terms of a weighted minimum. In particular, the Kleene-Dienes implication 
I {a, b) = max(l — a, b) makes this correspondence possible: 



B{yo) = Ai^i{l{Ai{xo),Bi{yo))) = - Ai{xo), Bi{yo)) (3) 



as follows: 



B{yo) = WMiriu{Bi{yo ), . . . , BN(yo)) (4) 

where u = (Ai(a:o), . . . , Ae(a:o)). 



Using the Sugeno Integral in the Fuzzy Inference System According 
to what has been deduced above, inference systems for both conjunctive and 
disjunctive rules can be formalized in terms of aggregation operators. Moreover, 
as the Sugeno integral generalizes both WMin and WMax the output of such 
inference systems can be understood in both cases as an integration of the val- 
ues i?i(yo) with respect of a fuzzy measure built from u = (Ai(a::o), . . . , A 7 v(a:o)). 

Let and y™™*” be fuzzy measures with = max^g^Mi and 

= 1 — maxi^zMi where Z C X and X := {1, 2, . . . , iV}. Since 
WXIaXulf) = and WMinu{f) = we can rewrite Ex- 

pressions 2 and 4 as follows: 



B{yo) = SI,j,^max{Bi{yo ), . . . ,H7v(yo)) 



( 5 ) 
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and, respectively: 

B{yo) = ■ • . ,-Bat(?/o)) (6) 

where u = (Ai(xo), . . . , Ajv(2;o))- 

Interim Conclusions for the Sugeno Integral The consequences of this 
formalization is that both conjunctive and disjunctive fuzzy rule based systems 
are expressed in terms of the Sugeno integral. The solely difference between 
the two approaches is the definition (from u) of the fuzzy measure. In the case 
of disjunctive rules, a possibility measure is used while in the case of 

disjunctive rules, a necessity measure is used. 

Moreover, as it is known that o Rj) C Uj{A' o Rj) it is easy to see 

that the two Sugeno integrals defined above (or, more precisely, the two fuzzy 
measures and /x™’”®” in conjunction with the Sugeno integral) define an 

interval. It is clear that the use of other fuzzy measures would lead to other 
values for the certainty degree (in or around the interval). 

An important aspect that cannot be skipped is that the weighting vectors 
used above are not possibilistic weighting vectors (or possibility distributions). 
This is so because, in general, u does not satisfy maxit^ = 1 as it is often the case 
that there is no i such that Ai^xq) = 1. The practical consequences of this fact 
is that the aggregation operator does not satisfy unanimity C(a, a, . . . , a) a. 

The rewriting of the fuzzy inference system in terms of Sugeno integrals yields 
to an important consequence. While WMin and WMax assume independence 
between the values to be aggregated, the Sugeno integral does not require such 
independence. Therefore, the Sugeno integral is a natural operator to combine 
the conclusions of several rules in a fuzzy rule based system when such rules are 
not independent. Figure 1 illustrates this situation. 

Figure 1 represents (left) the case of a fuzzy rule based system with rules on 
two variables X and Y with a grid-like structure, and (right) a similar system 
with a non-homogeneous structure. Note that in the case represented on the left 
hand side of the figure, for almost any pair of input values {x,y), four rules are 
applied. Exceptions correspond to the knots or the lines in the figure when only 
one or two fuzzy rules are applied. Instead, in the case represented on the right 
hand side, there is a region (around x = 2, y = 2)where the number of rules 
depends on the values (x,y). Such region is marked in both situations. 

Due to the regularity in the former case, a Sugeno integral with fuzzy mea- 
sures or is adequate for combining the outcomes of the rules. In 

fact, this is so because rules are independent (being applied in different subdo- 
mains). Instead, in the second case, when rules are not independent other mea- 
sures might be used to take into account the interaction between the rules. In 
such situation, when rules are combined using weighted maximum and weighted 
minimum, the output might be biased towards the outcomes of rules competing 
on similar input values. 

Nevertheless, for being the approach effective, fuzzy measures for the latter 
situation should be automatically defined from some previous knowledge and 
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>oooooo< 



x=2 



>oooooo< 



x=2 



Fig. 1. Graphical representation of two different fuzzy inference systems with two 
input variables are considered: memberships are given, and regions correspond to fuzzy 
rules 



the values Ai{xo). This would be similar to the definition of and 

from Ai{xo). 



Interpretation According to what has been said, in the Sugeno integral both 
the measure and the values being aggregated are in the same domain. Such 
values can be interpreted as importances, reliabilities or certainties. 

From an operational perspective, it can be considered that the Sugeno in- 
tegral proceeds like by ’’saturation”. It selects the importance that overcomes 
(saturates) a certain degree or threshold. In fact, as the threshold is decreasing 
while the inputs are increasing, it finds a tradeoff (or compromise) between the 
importance (or reliability or certainty degree) of the set and the importance that 
the members of the set have assigned. This follows from the graphical interpre- 
tation of the integral (see, e.g., [14]). 

4 Interpretation of the Twofold Integral 

Twofold integrals correspond to two-step fuzzy integrals: a Choquet integral of 
Sugeno integrals. See [9], [10] and [13] for details. Accordingly, such integrals can 
be studied in terms of the properties of the Choquet integral and the Sugeno 
integral. 

Let us consider the application of the twofold integral to the function f on X 
with respect to p,s and pLc- In this case, following the interpretations of the 
Sugeno integral we have given in Section 3, we have that p,s and / should both 
be in the same domain and measure a kind of importance and certainty. Then, 
pic can be used to measure a kind of randomness. 

Turning into the example of the rule based system, we can use the twofold 
to define a fuzzy inference system with randomness on the rules. 

Example 3. Let us consider a rule based fuzzy inference system. Let Bi{yo) be 
the certainties that rules Ri assign to a particular value j/o- Then, /is (A) is the 
certainty assigned to the set of rules A. Naturally, /is (A) is computed from 
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(the degree in which rules Xi have been fired) either using ^wmin 

other composite measure. Additionally, corresponds to some prior knowledge 
about the appropriatedness/ accuracy of the rules. So, a probability distribution 
(or a fuzzy measure) is defined over the set of rules. 

Then, to combine the values of Bi{yo) taking into account /rs(A) and yc the 
twofold integral of i?i(yo) with respect to /is(A) and yic will be used. 

5 Conclusions and Future Work 

In this paper we have considered the interpretation of fuzzy measures and fuzzy 
integrals. We have shown that the Sugeno integral is a natural extension of 
the operators used in fuzzy inference systems to aggregate the outcomes of the 
rules. This result permits use to give an example of the application of the twofold 
integral. 
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